Research issues in context-aware retrieval: needs of the Matcher-library
Peter Brown
Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk
ABSTRACT
The set of Java classes used by the Context Matcher needs to be made available to
the user (in this case not the end-user but the researcher
who wants to test retrieval strategies)
both within the Matcher itself, in order to allow
the user to plug in specially tailored
scoring algorithms, and in pre- and post-processors
which the user may want to attach to the Matcher.
This paper looks at some sample tasks that the user might want to perform, in order
to show what facilities should ideally be available in the Java classes.
Some basic assumptions
-
Within this document the "user" means the researcher who is writing
pieces of Java code to
investigate retrieval strategies, as distinct
from the "end-user" to whom the retrieved documents are ultimately aimed.
(In most cases the underlying data, e.g. the current contexts of the end-user, will
be simulated.)
-
We assume a Context Diary (previously called the history archive) is available; this is simply a sequence of contexts, each
marked with the time at which that context represented the current context.
The Context Diary need not be uniform: some of its contexts may contain different fields to others.
The Context Diary may also include contexts that relate to the future
instead of to the past: e.g. a context showing a location and time
relating to an end-user's future diary entry, or a context
showing a temperature at some future time as given by a weather forecast.
We need to decide as a practical matter whether or not the
Context Diary is implemented as a sequential file or as a database;
if the former we need to decide if it is backwards or forwards, i.e. earliest time first or latest
time first.
See discussion below on the database issue.
-
The Context Diary may get very large and in such cases the user (i.e. the researcher)
will almost certainly want to restrict it.
Sample restrictions might be: (a) only contexts
within the range [-T1, T2] of the current time, meaning contexts that are not
more than T1 minutes before the current time, and not more than T2 minutes
after; and/or (b) the N contexts that are closest to the current time.
We call this restricted Context Diary the relevant diary.
For efficiency reasons it will doubtless be best to extract
at load time the relevant diary from the full Context Diary.
(Detailed points:
(1) in practice the times T1 and T2 above are rather dependent on the nature of fields, since an hour might be a long time in relation to location change, but a short time
in relation to temperature change; however in the short term we can live with global
values for T1 and T2 to cover all fields; (2) we need to fix how T1 and
T2 are passed to the library.)
-
We will use the term Matcher-library to describe the set of Java classes
used by the Matcher, but also available to the pre- and post-processors and to any
other program that wants these classes.
In this document we will just use the term "library" to denote the Matcher-library.
(It is a pragmatic matter as to whether it is really a Java library or the equivalent of an "include" file.)
-
The overall architecture is a pipeline.
However for efficiency reasons we might cheat a bit and convey
information between pipeline components by means of files stored in
an internal form; this might save a lot of re-parsing and
re-creation of source formats.
-
The pre-processors, etc., that we describe below know what the fields are, e.g. they are written specifically
for an application which is known to have fields Temperature, Body,
Title, etc.
There may, in the future be a need for more general pre-processors that
discover what the fields are (by using the library to interrogate the document collection?),
but we do not cover those here.
Perhaps this is not a big problem anyway as the information may be naturally available when
using the library.
Some examples
The following examples show some of programs the user might want to write.
The purpose of showing these is to identify the likely demands on the library.
Example 1: updating the Context Diary
- Description:
-
We assume contexts are added one at a time to the Context Diary.
This will be done: (a) to record the current context at various intervals;
(b) to introduce future contexts, derived, for instance, from
weather forecasts.
In case (a) the current context presented as a record
to be added to the Context Diary may be
a subset of the true current context: for example the application
may choose to record only those fields set by sensors.
In addition there needs to be a way of deleting individual Context Diary records,
for instance when a future diary event is cancelled, or is detected not
to really have occurred.
-
Inputs:
-
(i) The Context Diary, plus (ii) the context to be added to the Diary.
We assume the same information is supplied in the deletion case too, though an alternative is
to supply a UniqueID for the document to be deleted.
-
Algorithms used and demands on library:
-
It is desirable to perform a syntax check of input (ii), which we assume is in
source SGML form.
We assume the library provides a method to do this: it should
return a (possibly null) set of error messages.
In the deletion case there needs to be a check that the record is really
there (is this just a matching operation plus a check that there is exactly
one output?)
If the Context Diary is a database, updating should be trivial.
- Outputs:
-
The only explicit outputs are the possible error messages.
The overall effect is to update the Context Diary.
Example 2: a pre-processor to massage the current context on the basis of history
- Description:
-
This pre-processor takes a current context and massages it into a new current context based
on anticipated needs, e.g. by pretending the end-user is in advance of their true
current context.
For instance the pre-processor might look at the locations in the relevant
diary in the past and adjust the current value accordingly to be a bit ahead
of the true position.
It may also take account of locations in the future history.
-
Inputs:
-
The current context and the Context Diary -- or more particularly
the relevant diary.
- Algorithms used and demands on library:
-
-
Extract the relevant diary from the Context Diary (does the library provide help for this?).
-
For each field in the current context, follow the following procedure.
Take each context in the relevant diary and see if the current field
matches this in name, label (if implemented) and data type.
(This should be a call of a "field matching" method in the library;
actually it may be useful to match the value too, since a diary
value that was a complete mismatch to the current value could be interpreted to mean that diary
was irrelevant and should be ignored.)
If there is a match, remember the diary value and the associated time.
The result from this stage is a vector of diary times with associated values.
-
For each field do some ad hoc calculations using the current context and the vector
derived above.
The calculations may involve numeric or string manipulations depending on the data type
of the field we are looking at.
(For example for a Location field the calculations
may try to guess the direction of movement of the end-user.)
The library should provide a method for asking the data type of the field
(e.g. as in the Range class of the old Context Matcher).
The output from the calculations is a new value for the current field.
-
Outputs:
-
The output is a copy of the current context supplied as input, but with some of the
values changed.
This is then fed into the next stage of the pipeline as the current context.
Example 3: a pre-processor to set field weights
- Description:
-
This pre-processor looks at the diary (usually just the past diary, i.e. T2 above
is zero) of
fields in order to set their field weights.
For instance static fields may be given low weights and rapidly changing
fields may be given high weights.
Some fields may be given zero weights, thus making them inactive.
-
Inputs:
- As for previous example.
- Algorithms used and demands on library:
-
This is very similar to the previous example.
A refinement is that this time we may want to perform a full match
of the current field against the corresponding field of a diary entry.
A perfect score represents a perfect match and therefore a static field.
Otherwise the score may be used to detect a rapidly changing field
(detected by a score constantly changing by large margins as we
go back in history, though an almost complete mismatch with history may indicate that history is irrelevant).
The extra demands on the library are just that it must be possible
for the user's Java code to call
a field matching method in the library that returns a field score.
-
Outputs:
-
A set of weights to be used in the retrieval stage of the pipeline.
An example is:
-w location:1.5;temperature:0.3
Example 4: field scoring methods plugged into the retrieval engine
- Description:
-
The user may wish to plug in scoring methods for individual fields.
-
Inputs:
-
We are not talking about source form inputs here, but arguments supplied
by the Matcher to the user-written Java methods that have been plugged in.
These should be the two fields to be matched.
-
Algorithms used and demands on library:
-
The two arguments to the plug-in will be of class Field.
The user's algorithm will need to look at the values of the fields,
and to find out (a) what data type they are and (b) their initial field score.
The algorithm will then do some ad hoc arithmetic on the values.
It is possible that the algorithm may want to look at the Context Diary -- see earlier
examples.
-
Outputs:
-
These methods will return a field score to the caller (the Matcher).
Example 5: document scoring method plugged into the retrieval engine
- Description:
-
The user
may also want to plug in a method that combines the field scores of a document
into a document score.
It might need access to the initial document score.
-
Inputs:
-
We are not talking about source form inputs here, but arguments supplied
by the Matcher to the user-written Java methods.
The arguments must provide, for each field of the document that was
involved in the matching, its field name, score and its field weighting.
-
Algorithms used and demands on library:
-
This will be an ad hoc algorithm that makes no special needs on
the library.
-
Outputs:
-
This method will return a document score to the caller.
Example 6: a post-processor to adjust scores
- Description:
-
This post-processor changes the field scores and document scores in the
retrieved documents.
It may even eliminate some documents (e.g. those relating to
sites the user has passed already) by setting their scores to zero.
-
Inputs:
-
A set of retrieved documents, the current context (which might have changed since the original retrieval request was made if retrieval is
slow), and possibly a Context Diary.
The retrieved documents will have scores on the retrieved fields and
on each document; the documents will be ordered in decreasing document
score.
A sample fragment of a document (here shown in source form
though in reality it may be in internal form) might be:
<note score="1.2">
<location score="1.9"> ..
<body> ...
In the above the overall document score is 1.2, and the Location field has a score of 1.9.
The Body field does not have a score, and therefore was not
involved in the matching.
-
Algorithms used and demands on library:
-
One approach is to implement this post-processor as another instance
of the Matcher.
This takes (a) the documents retrieved by the previous Matcher;
(b) a current context (which may be a subset of the true current context, e.g.
it may be just location); and (c) perhaps a Context Diary.
The user encodes their requirement by plug-ins to (re-)set
field scores.
For example the plug-in to match locations may set the score to zero if
the document's location is behind the current location
(deciding this will need access to the Context Diary);
otherwise the initial score may simply be copied over to the output.
An alternative is a specially written pre-processor.
The post-processor may use the library to: load the set of previously
retrieved documents; perhaps load a Context Diary and extract the
relevant diary; loop through the documents, extracting a certain field of
interest (e.g. Temperature) if it is present.
The user's code may then calculate a new field value, which it
will feed back to the library code.
The above may be repeated for other fields that are of interest.
At various stages the user's code might ask the library to re-calculate
document scores.
Finally the library will be asked to sort its documents into descending order
of document score, prior to output.
-
Outputs:
-
A set of retrieved documents; usually this will be a subset of the
input set of retrieved documents, but with different scores.
The documents will, in the usual way, be sorted into descending document score.
Example 7: post-processor to work on a cache
- Description:
-
In many situations it is attractive to perform one retrieval operation
to extract all the documents in the general vicinity of the current context,
and then to use this as a cache for subsequent retrieval.
The cache may last anything from a few seconds to a whole day.
Often the cache will be downloaded into a small hand-held device,
which may then be disconnected from the host for a whole day while the
end-user is out in the field.
In such cases the Matcher on the cache will run on the hand-held
device; it may be a cut down version.
-
Inputs:
-
As for an ordinary retrieval operation with the Matcher, except
now the document collection is a cache extracted by a previous call of
the Matcher.
-
Algorithms used and demands on library:
-
No different from an ordinary Matcher.
-
Outputs:
-
No different from an ordinary Matcher.
Example 8: post-processor to do ordinary IR
- Description:
-
A post-processor might wish to do an ordinary information retrieval
operation to extract the documents of special current interest to the end-user,
e.g. those documents whose Title and/or Body
fields relates to architecture.
-
Inputs:
-
As for an ordinary call of the Matcher.
The current context is likely to be set explicitly by the end-user
rather than by sensors, but this makes no difference.
-
Algorithms used and demands on library:
-
This is a standard use of the Matcher, except that it will run in "reverse"
mode, i.e. interactive rather than proactive, or, to put
this another way, driven by the current context rather than by contexts
associated with the document collection.
-
Outputs:
-
As for standard use of the Matcher.
Example 9: post-processor to merge the output from several retrievals
- Description:
-
It may be desirable to initiate several independent retrieval operations, and then
use a post-processor to merge the results (e.g. to pick out documents that score well
on two independent search criteria, or to analyse what words occur frequently in the
textual fields of two sets of independently-retrieved documents).
Conceivably the retrieval operations will use different document collections, e.g. one
search of the web and one search of a collection of tourism stick-e notes, or one search
of tourism stick-e notes and one search of architecture stick-e notes.
In such cases the post-processor will need to contain some elaborate programming,
but this will largely be independent of the library.
Here we will assume the simpler case where all the retrievals apply to the same document collection.
-
Inputs:
-
Several document collections (where each document collection is a set
of documents that has previously been retrieved from a larger document collection).
(Most other applications have one document collection, but hopefully a multiplicity
does not cause any special problems for the library.)
-
Algorithms used and demands on library:
-
The post-processor will need to detect if the same document occurs in two
separate collections.
This implies that every document in a document collection should have a UniqueID field that is
automatically attached to it (?added on loading or is it embedded in the document collection?),
and that this field can be part of the retrieved information if required.
(Such a field would also be useful to an application that wanted to record what
documents had already been shown to the user.)
Otherwise this post-processor does not appear to make any special extra demands on the library.
-
Outputs:
-
Probably as for an ordinary Matcher; the documents that are output will usually be a subset
of those input, and will have changed scores.
Example 10: final module that presents information to the end-user
- Description:
-
This module will present retrieved documents to the user.
The exact way it does this (e.g. placing a dot as a hot-spot on a map, maintaining an
interface similar to e-mail) is not of concern here.
Often the module will just extract certain fields from the retrieved documents.
It may wish to maintain records of which documents the user has seen.
(As a special feature, if a new version of a previous document is retrieved, e.g. an
updated traffic report, this may have special significance; however such details
are not relevant to our current research programme, and we can ignore them for the present.)
In addition the module may wish to provide adaptive feedback concerning the relevance to the
user of the documents delivered.
-
Inputs:
-
A set of retrieved documents.
-
Algorithms used and demands on library:
-
This module will require the usual parsing and field extraction facilities.
It also requires that a document has a UniqueID field to identify it.
In particular this may be used for adaptive feedback on the relevance of delivered documents:
there may be a table of UniqueIDs with a feedback score on each.
Such a table might be used by another pre- or post-processor (not explicitly on
the list described here).
-
Outputs:
-
The output will usually be an object on the user's screen.
Implementing the Context Diary
The operations that need to be performed on the Context Diary are:
-
adding a context.
-
deleting a context.
-
matching against the current context, for example the user might want to
find the Diary context that most closely matches their
current context (the matching will usually be limited
to certain fields of the current context).
-
extracting Diary information in general;
this will also often involve matching operations, e.g.
to find the Diary record for the time closest to a certain
designated time.
Interestingly these are almost identical to the requirements for
manipulating the document collection.
Hence I propose that, in the short term at least, the Context Diary takes
an identical format to the documentation collection.
[[The alternative is to use a database to store the Context Diary (even though the
document collection probably will not use a database).
The main arguments for this are ease of use and efficiency.
Even though the operations on the Context Diary may be similar
to those on the document collection, their relative frequency will
be much different: for example updating of the Diary is
likely to be very frequent.
A database is likely to be geared to working efficiently when there
are frequent updates.
A database may, however, have problems if the Diary records are
highly non-uniform, with different records having different fields.]]
With this assumption, it may be easiest for the user to
implement Example 1 as an instance of the Matcher, especially
if all that is required is additions to the library rather than deletions.
The Matcher will take as inputs the Context Diary (this will be
supplied as if it were the document collection), and the current
context to be added.
The updating algorithm will be supplied as a plug-in to the Matcher.
The plug-in will add a time field to the current context, if not
already present, in order to
record the current time, and then add the augmented current context to the document collection
(a fairly trivial operation irrespective of whether the
document collection is a Java Vector or a linked list); finally it will
call a library method to save the document collection (presumably
in internal form).
The Context Diary will be ordered by time, i.e. the value in the
Time field -- this might be the internal Unix time of
seconds since 1976 (?).
It is the responsibility of the user, not the library, to maintain
this ordering.
Thus when adding to the Context Diary, the user is responsible for
adding it in the right place, i.e. their plug-in should do this.
(Incidentally this plug-in raises the need for the Matcher to provide a slot: "call this plug-in just before matching of fields starts".)