Research issues in context-aware retrieval: needs of the Matcher-library

Peter Brown

Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk

ABSTRACT

The set of Java classes used by the Context Matcher needs to be made available to the user (in this case not the end-user but the researcher who wants to test retrieval strategies) both within the Matcher itself, in order to allow the user to plug in specially tailored scoring algorithms, and in pre- and post-processors which the user may want to attach to the Matcher. This paper looks at some sample tasks that the user might want to perform, in order to show what facilities should ideally be available in the Java classes.

Some basic assumptions

Within this document the "user" means the researcher who is writing pieces of Java code to investigate retrieval strategies, as distinct from the "end-user" to whom the retrieved documents are ultimately aimed. (In most cases the underlying data, e.g. the current contexts of the end-user, will be simulated.)
We assume a Context Diary (previously called the history archive) is available; this is simply a sequence of contexts, each marked with the time at which that context represented the current context. The Context Diary need not be uniform: some of its contexts may contain different fields to others. The Context Diary may also include contexts that relate to the future instead of to the past: e.g. a context showing a location and time relating to an end-user's future diary entry, or a context showing a temperature at some future time as given by a weather forecast. We need to decide as a practical matter whether or not the Context Diary is implemented as a sequential file or as a database; if the former we need to decide if it is backwards or forwards, i.e. earliest time first or latest time first. See discussion below on the database issue.
The Context Diary may get very large and in such cases the user (i.e. the researcher) will almost certainly want to restrict it. Sample restrictions might be: (a) only contexts within the range [-T1, T2] of the current time, meaning contexts that are not more than T1 minutes before the current time, and not more than T2 minutes after; and/or (b) the N contexts that are closest to the current time. We call this restricted Context Diary the relevant diary. For efficiency reasons it will doubtless be best to extract at load time the relevant diary from the full Context Diary. (Detailed points: (1) in practice the times T1 and T2 above are rather dependent on the nature of fields, since an hour might be a long time in relation to location change, but a short time in relation to temperature change; however in the short term we can live with global values for T1 and T2 to cover all fields; (2) we need to fix how T1 and T2 are passed to the library.)
We will use the term Matcher-library to describe the set of Java classes used by the Matcher, but also available to the pre- and post-processors and to any other program that wants these classes. In this document we will just use the term "library" to denote the Matcher-library. (It is a pragmatic matter as to whether it is really a Java library or the equivalent of an "include" file.)
The overall architecture is a pipeline. However for efficiency reasons we might cheat a bit and convey information between pipeline components by means of files stored in an internal form; this might save a lot of re-parsing and re-creation of source formats.
The pre-processors, etc., that we describe below know what the fields are, e.g. they are written specifically for an application which is known to have fields Temperature, Body, Title, etc. There may, in the future be a need for more general pre-processors that discover what the fields are (by using the library to interrogate the document collection?), but we do not cover those here. Perhaps this is not a big problem anyway as the information may be naturally available when using the library.

Some examples

The following examples show some of programs the user might want to write. The purpose of showing these is to identify the likely demands on the library.

Example 1: updating the Context Diary

Description:: We assume contexts are added one at a time to the Context Diary. This will be done: (a) to record the current context at various intervals; (b) to introduce future contexts, derived, for instance, from weather forecasts. In case (a) the current context presented as a record to be added to the Context Diary may be a subset of the true current context: for example the application may choose to record only those fields set by sensors. In addition there needs to be a way of deleting individual Context Diary records, for instance when a future diary event is cancelled, or is detected not to really have occurred.
Inputs:: (i) The Context Diary, plus (ii) the context to be added to the Diary. We assume the same information is supplied in the deletion case too, though an alternative is to supply a UniqueID for the document to be deleted.
Algorithms used and demands on library:: It is desirable to perform a syntax check of input (ii), which we assume is in source SGML form. We assume the library provides a method to do this: it should return a (possibly null) set of error messages. In the deletion case there needs to be a check that the record is really there (is this just a matching operation plus a check that there is exactly one output?) If the Context Diary is a database, updating should be trivial.
Outputs:: The only explicit outputs are the possible error messages. The overall effect is to update the Context Diary.

Example 2: a pre-processor to massage the current context on the basis of history

Description:

This pre-processor takes a current context and massages it into a new current context based on anticipated needs, e.g. by pretending the end-user is in advance of their true current context. For instance the pre-processor might look at the locations in the relevant diary in the past and adjust the current value accordingly to be a bit ahead of the true position. It may also take account of locations in the future history.

Inputs:

The current context and the Context Diary -- or more particularly the relevant diary.

Algorithms used and demands on library:

Extract the relevant diary from the Context Diary (does the library provide help for this?).
For each field in the current context, follow the following procedure. Take each context in the relevant diary and see if the current field matches this in name, label (if implemented) and data type. (This should be a call of a "field matching" method in the library; actually it may be useful to match the value too, since a diary value that was a complete mismatch to the current value could be interpreted to mean that diary was irrelevant and should be ignored.) If there is a match, remember the diary value and the associated time. The result from this stage is a vector of diary times with associated values.
For each field do some ad hoc calculations using the current context and the vector derived above. The calculations may involve numeric or string manipulations depending on the data type of the field we are looking at. (For example for a Location field the calculations may try to guess the direction of movement of the end-user.) The library should provide a method for asking the data type of the field (e.g. as in the Range class of the old Context Matcher). The output from the calculations is a new value for the current field.

Outputs:

The output is a copy of the current context supplied as input, but with some of the values changed. This is then fed into the next stage of the pipeline as the current context.

Example 3: a pre-processor to set field weights

Description:

This pre-processor looks at the diary (usually just the past diary, i.e. T2 above is zero) of fields in order to set their field weights. For instance static fields may be given low weights and rapidly changing fields may be given high weights. Some fields may be given zero weights, thus making them inactive.

Inputs:

As for previous example.

Algorithms used and demands on library:

This is very similar to the previous example. A refinement is that this time we may want to perform a full match of the current field against the corresponding field of a diary entry. A perfect score represents a perfect match and therefore a static field. Otherwise the score may be used to detect a rapidly changing field (detected by a score constantly changing by large margins as we go back in history, though an almost complete mismatch with history may indicate that history is irrelevant). The extra demands on the library are just that it must be possible for the user's Java code to call a field matching method in the library that returns a field score.

Outputs:

A set of weights to be used in the retrieval stage of the pipeline. An example is:

-w location:1.5;temperature:0.3

Example 4: field scoring methods plugged into the retrieval engine

Description:: The user may wish to plug in scoring methods for individual fields.
Inputs:: We are not talking about source form inputs here, but arguments supplied by the Matcher to the user-written Java methods that have been plugged in. These should be the two fields to be matched.
Algorithms used and demands on library:: The two arguments to the plug-in will be of class Field. The user's algorithm will need to look at the values of the fields, and to find out (a) what data type they are and (b) their initial field score. The algorithm will then do some ad hoc arithmetic on the values. It is possible that the algorithm may want to look at the Context Diary -- see earlier examples.
Outputs:: These methods will return a field score to the caller (the Matcher).

Example 5: document scoring method plugged into the retrieval engine

Description:: The user may also want to plug in a method that combines the field scores of a document into a document score. It might need access to the initial document score.
Inputs:: We are not talking about source form inputs here, but arguments supplied by the Matcher to the user-written Java methods. The arguments must provide, for each field of the document that was involved in the matching, its field name, score and its field weighting.
Algorithms used and demands on library:: This will be an ad hoc algorithm that makes no special needs on the library.
Outputs:: This method will return a document score to the caller.

Example 6: a post-processor to adjust scores

Description:

This post-processor changes the field scores and document scores in the retrieved documents. It may even eliminate some documents (e.g. those relating to sites the user has passed already) by setting their scores to zero.

Inputs:

A set of retrieved documents, the current context (which might have changed since the original retrieval request was made if retrieval is slow), and possibly a Context Diary. The retrieved documents will have scores on the retrieved fields and on each document; the documents will be ordered in decreasing document score. A sample fragment of a document (here shown in source form though in reality it may be in internal form) might be:

<note score="1.2">
  <location score="1.9"> ..
  <body> ...

In the above the overall document score is 1.2, and the Location field has a score of 1.9. The Body field does not have a score, and therefore was not involved in the matching.

Algorithms used and demands on library:

One approach is to implement this post-processor as another instance of the Matcher. This takes (a) the documents retrieved by the previous Matcher; (b) a current context (which may be a subset of the true current context, e.g. it may be just location); and (c) perhaps a Context Diary. The user encodes their requirement by plug-ins to (re-)set field scores. For example the plug-in to match locations may set the score to zero if the document's location is behind the current location (deciding this will need access to the Context Diary); otherwise the initial score may simply be copied over to the output.

An alternative is a specially written pre-processor. The post-processor may use the library to: load the set of previously retrieved documents; perhaps load a Context Diary and extract the relevant diary; loop through the documents, extracting a certain field of interest (e.g. Temperature) if it is present. The user's code may then calculate a new field value, which it will feed back to the library code. The above may be repeated for other fields that are of interest. At various stages the user's code might ask the library to re-calculate document scores. Finally the library will be asked to sort its documents into descending order of document score, prior to output.

Outputs:

A set of retrieved documents; usually this will be a subset of the input set of retrieved documents, but with different scores. The documents will, in the usual way, be sorted into descending document score.

Example 7: post-processor to work on a cache

Description:: In many situations it is attractive to perform one retrieval operation to extract all the documents in the general vicinity of the current context, and then to use this as a cache for subsequent retrieval. The cache may last anything from a few seconds to a whole day. Often the cache will be downloaded into a small hand-held device, which may then be disconnected from the host for a whole day while the end-user is out in the field. In such cases the Matcher on the cache will run on the hand-held device; it may be a cut down version.
Inputs:: As for an ordinary retrieval operation with the Matcher, except now the document collection is a cache extracted by a previous call of the Matcher.
Algorithms used and demands on library:: No different from an ordinary Matcher.
Outputs:: No different from an ordinary Matcher.

Example 8: post-processor to do ordinary IR

Description:: A post-processor might wish to do an ordinary information retrieval operation to extract the documents of special current interest to the end-user, e.g. those documents whose Title and/or Body fields relates to architecture.
Inputs:: As for an ordinary call of the Matcher. The current context is likely to be set explicitly by the end-user rather than by sensors, but this makes no difference.
Algorithms used and demands on library:: This is a standard use of the Matcher, except that it will run in "reverse" mode, i.e. interactive rather than proactive, or, to put this another way, driven by the current context rather than by contexts associated with the document collection.
Outputs:: As for standard use of the Matcher.

Example 9: post-processor to merge the output from several retrievals

Description:: It may be desirable to initiate several independent retrieval operations, and then use a post-processor to merge the results (e.g. to pick out documents that score well on two independent search criteria, or to analyse what words occur frequently in the textual fields of two sets of independently-retrieved documents). Conceivably the retrieval operations will use different document collections, e.g. one search of the web and one search of a collection of tourism stick-e notes, or one search of tourism stick-e notes and one search of architecture stick-e notes. In such cases the post-processor will need to contain some elaborate programming, but this will largely be independent of the library. Here we will assume the simpler case where all the retrievals apply to the same document collection.
Inputs:: Several document collections (where each document collection is a set of documents that has previously been retrieved from a larger document collection). (Most other applications have one document collection, but hopefully a multiplicity does not cause any special problems for the library.)
Algorithms used and demands on library:: The post-processor will need to detect if the same document occurs in two separate collections. This implies that every document in a document collection should have a UniqueID field that is automatically attached to it (?added on loading or is it embedded in the document collection?), and that this field can be part of the retrieved information if required. (Such a field would also be useful to an application that wanted to record what documents had already been shown to the user.) Otherwise this post-processor does not appear to make any special extra demands on the library.
Outputs:: Probably as for an ordinary Matcher; the documents that are output will usually be a subset of those input, and will have changed scores.

Example 10: final module that presents information to the end-user

Description:: This module will present retrieved documents to the user. The exact way it does this (e.g. placing a dot as a hot-spot on a map, maintaining an interface similar to e-mail) is not of concern here. Often the module will just extract certain fields from the retrieved documents. It may wish to maintain records of which documents the user has seen. (As a special feature, if a new version of a previous document is retrieved, e.g. an updated traffic report, this may have special significance; however such details are not relevant to our current research programme, and we can ignore them for the present.) In addition the module may wish to provide adaptive feedback concerning the relevance to the user of the documents delivered.
Inputs:: A set of retrieved documents.
Algorithms used and demands on library:: This module will require the usual parsing and field extraction facilities. It also requires that a document has a UniqueID field to identify it. In particular this may be used for adaptive feedback on the relevance of delivered documents: there may be a table of UniqueIDs with a feedback score on each. Such a table might be used by another pre- or post-processor (not explicitly on the list described here).
Outputs:: The output will usually be an object on the user's screen.

Implementing the Context Diary

The operations that need to be performed on the Context Diary are:

adding a context.
deleting a context.
matching against the current context, for example the user might want to find the Diary context that most closely matches their current context (the matching will usually be limited to certain fields of the current context).
extracting Diary information in general; this will also often involve matching operations, e.g. to find the Diary record for the time closest to a certain designated time.

Interestingly these are almost identical to the requirements for manipulating the document collection. Hence I propose that, in the short term at least, the Context Diary takes an identical format to the documentation collection.

[[The alternative is to use a database to store the Context Diary (even though the document collection probably will not use a database). The main arguments for this are ease of use and efficiency. Even though the operations on the Context Diary may be similar to those on the document collection, their relative frequency will be much different: for example updating of the Diary is likely to be very frequent. A database is likely to be geared to working efficiently when there are frequent updates. A database may, however, have problems if the Diary records are highly non-uniform, with different records having different fields.]]

With this assumption, it may be easiest for the user to implement Example 1 as an instance of the Matcher, especially if all that is required is additions to the library rather than deletions. The Matcher will take as inputs the Context Diary (this will be supplied as if it were the document collection), and the current context to be added. The updating algorithm will be supplied as a plug-in to the Matcher. The plug-in will add a time field to the current context, if not already present, in order to record the current time, and then add the augmented current context to the document collection (a fairly trivial operation irrespective of whether the document collection is a Java Vector or a linked list); finally it will call a library method to save the document collection (presumably in internal form). The Context Diary will be ordered by time, i.e. the value in the Time field -- this might be the internal Unix time of seconds since 1976 (?). It is the responsibility of the user, not the library, to maintain this ordering. Thus when adding to the Context Diary, the user is responsible for adding it in the right place, i.e. their plug-in should do this. (Incidentally this plug-in raises the need for the Matcher to provide a slot: "call this plug-in just before matching of fields starts".)