P.J. Brown and (Todo?)
Department of Computer Science, University of Exeter, Exeter EX4 4PT, UK
P.J.Brown@exeter.ac.uk
ABSTRACT
There is a plethora of approaches to retrieval. At one extreme is a web search engine, which provides the user with complete freedom to search a collection of perhaps over a billion documents. At the opposite is a web page where the author has supplied a small number of links to outside documents, chosen in advance. Many practical retrieval needs lie between these two extremes. This paper aims to analyse the multi-dimensional spectrum of retrieval methods, and presents some hypothetical tools, hopefully blueprints to the future, that aim to cover wider areas of the spectrum than current tools do.
Left out: The Grid. Todo: IR/IF systems already have relevance feed-back: add terms or change weights; Read Moreau & Hall, add more about static/dynamic document collections?, link service; add spaces before '[' and use of italics -- or use Netscape to print;
Hall [Hall00] highlights two extremes in methods of retrieving information: (a) the traditional hypertext fixed link, which leads to a single document, and (b) the web search engine. Hall, together with several other researchers, has prophesied that the future lies between these extremes. This applies both on fixed terminals with large screens, and, even more, on personal devices with small screens and generally sparse computational and input/output resources. The purpose of this paper is to explore the spectrum of possibilities between these extremes; not surprisingly, it is a multi-dimensional spectrum.
We will start by giving some examples of applications that represent points on the spectrum, beginning with the most general forms of retrieval and moving towards more focussed retrieval.
These applications just represent points on a wide, multi-dimensional spectrum.
This paper covers a range of technologies for retrieval. These different technologies often use different terminology. For consistency we will choose, for this paper, one terminology: the terminology of Information Retrieval. Thus we have a query, which specifies the user's needs, and a document collection from which documents that match the query are retrieved. The retrieved documents are delivered to the user. We will take a wide view of what these terms cover: e.g. a document collection could be a set of documents in a human's mind, and the query could be a thought that the human has, e.g. `I need to provide some cross-references on topic X; which documents would be best for this?'.
Retrieval can be applied to any nature of material, but here we will assume that everything is textual, since this is still by far the most common sort of retrieval. The principles discussed here, should, however, carry over to other types of retrieval.
We assume the end user is a human, who wants some information and hopes to acquire this information by reading one or more of the delivered documents. An alternative scenario, which we will not consider, is where the delivered documents are programs, and these are automatically run (an example would be where the query represented readings of sensors that recorded the office environment, and the delivered document was a program that, say, adjusted the air conditioning in order to improve the environment).
We use the word application as a generic term to cover all all the retrieval systems we discuss: thus an IR system is one application, and WWW and Microcosm are examples of hypertext applications.
As a final piece of terminology, which we use when talking about the web, a web presentation is a set of integrated web pages, such as the set of pages describing a company's products.
Retrieval is a process of three stages:
Actually the three-stage model is a simplification of what may really happen in practice. In particular there may be iterations involving either the first two stages or all three. One of our introductory examples -- the composite retrieval application that involved a succession of filters -- illustrated iteration around the first two stages.
Iteration over all three stages might occur if a delivered document is not a predefined document but instead is a process for creating a document, like a CGI script on the web. This process itself could involve further retrieval, as would apply if the document delivered on one retrieval was a link to a search engine that then performed a further retrieval.
Each of the above three stages can be performed some time in advance of its successor, with the pre-calculated results being continually re-used by the successor stage. In the first stage either the query or the document collection, or both, can be specified in advance; an example of this is information filtering. Here the user specifies the query in advance and this query is continually re-used for retrieval, perhaps over a period of years, until the query is re-specified.
One model for the way a web page containing fixed links is created is that the retrieval of related information is done in advance (usually by a human author, but perhaps by automatic tools that help create web pages) and this is presented to the user as a set of links embedded in the page. The interactive user is only involved in the selection stage, where they pick the link to follow.
If retrieval is an iterative process, say of two iterations, then the first iteration can be done in advance -- typically creating a small cache of retrieved documents extracted from a large document collection -- and the second stage can be done on-the-fly.
The advantages of doing work in advance are many: speed of response is increased, repetitive work is avoided, work done in advance may be exploited by many users, data transmission charges may be reduced, and (with downloaded caches) problems with lack of connectivity can be surmounted. A prime example of the gains from doing work in advance are web search engines: these can search a billion documents in an amazingly short time. Two keys to doing this are (a) collecting the documents in advance and (b) pre-processing the document collection into a surrogate form that facilitates fast retrieval. A web search engine that worked entirely on-the-fly would not be a runner. In general, if an application deals with really large amounts of information, and if the application needs to deliver results in real-time, then it is imperative that some work be done in advance.
There are, however, disadvantages. The danger of doing work in advance is that the world may change and the work done in advance may be invalidated. The best known example of this is the dangling hypertext link: the author has retrieved a document in advance and provided a link to where the document was found, but when the user selects the link the document is no longer there. Doing all the stages on-the-fly greatly reduces this danger.
As the above examples indicate, the process of obtaining information has many dimensions, as represented by the following parameters:
Each dimension represents a spectrum of possibilities (though it is a narrow spectrum in the case of delivery if there are just the two possibilities: proactive and interactive).
A common factor among all of these dimensions is the choice between being free and being constrained: whether the user has complete control and responsibility, or whether the application takes some of this control and responsibility to reduce the burden on the user.
This paper discusses how existing tools cover parts of this multi-dimensional spectrum, how these tools could be generalised to cover more of the spectrum, and how new tools might be designed to cover areas of the spectrum that are not well covered at present.
Before discussing details, we will describe two general considerations: the use of context and the user interface.
In all the application areas we have described there is a continuing quest to make the delivered documents more relevant to the user. The general approach to tailoring information to a user is to collect information about that user or about other similar users who have the same needs. We call this the user's context. Context can potentially be wide-ranging [Sal99]. Sensors can detect the user's physical context: where they are, what direction they are going, what companions/equipment are nearby. The user's computing context is easy to capture: what document they are reading, what other activities they are pursuing, etc. (The document being read is of interest not only for its content, but also for any metadata or semantic information associated with it.) Wider aspects of context can be extracted from the web: the current weather and the weather forecast, share prices, traffic information, etc. All of these can have a bearing on what retrieved documents are currently relevant to the user. The user's past behaviour can be analysed, perhaps with some feedback from the user on what retrieved information was most relevant, and this analysis can be extended to cover other, similar, users, whose past behaviour may be a guide to the current user (as in peer-to-peer search engines). The context can also include user models and task models.
Overall the context consists of many different items -- we call them fields; at any one time some fields may be irrelevant to the retrieval process whereas others may be highly relevant. Thus we have a concept of a weight attached to each field, and these weights can change dynamically according to the user's needs. Weights are typically set by automatic tools, since it would be a big burden on the user to continually set and update them.
The main advantage of context is that, if it is collected automatically and then used effectively, it can bring big improvements to the relevance of the documents delivered, all without any extra user effort. There can, however, be further gains if the user does make an extra effort, by telling the application something about their nature and needs, e.g. whether they are a beginner or expert, and what topics they are currently interested in. This can be done as a once-off activity, in advance of the retrieval requests.
For all types of context it is useful if the application maintains a history, e.g. a trail of locations visited, or documents previously viewed. This `history' may encompass the future too, as derived, for example, from future diary entries. An example is that the user's diary might say they should be at a certain location in two hour's time; this element of context might have an effect on the user's current information needs. Context can be further enriched by guessing higher level contextual states from the values of low-level sensors. For example Pepys [New91] used active badge information to detect whether a meeting was taking place (several people converging to the same place at roughly the same time, and then staying there). Being in a meeting is an important factor of a user's context, and should affect which documents are delivered to them, and, indeed, the user interface for delivery and selection.
Overall, therefore, the context can represent a rich resource for helping to find what information the user is interested in. In our introductory set of examples, the one labelled `context-aware retrieval' was the most manifest case of this, but context could be valuable in many of the other examples too. Exploiting physical context is probably easiest, e.g. the location of a mobile user. Matching of locations is easy (though not trivial, for instance when the user is constrained by streets) and reliable in the sense that all the documents whose associated location is close should be delivered, and all those that are far away will not. At the other end of the scale, looking at the contextual history of a user's past retrievals, and making this influence the current retrieval operation is hard and potentially unreliable -- occasionally it will lead to irrelevant documents being delivered or relevant documents missed. However most web browsers provide a small step in this direction: they keep a history of links followed, and, each time a link is displayed, display it in a different colour if the destination of the link has already been visited. The web browser does not make any judgement on the relevance of this information: it just tells the user and lets them judge. It is also relatively easy for a web browser to test if the information at the link destination has changed since the user last viewed it, and alert the user if this is so.
Frequently it is the current context that is most relevant, and this context may be changing quickly. Generally the user's current context cannot be known in advance, and hence context is best used in retrieval processes that are on-the-fly.
There is, however, one case that is exceptional: as we have said, in a hypertext system, the author, when embedding links in a page, knows that the user will be reading that page when she selects the links -- thus they know in advance the document (i.e. web page) being read. This is, of course, part of the user's context. If the page is part of a hierarchy or a sequence of pages, it may be further deduced how the user reached the current page, and what documents they saw on the way; in addition the overall nature of the web presentation to which the page belongs may be a guide to the nature of the user (e.g. the presentation may be `Mathematics for the conceptually challenged'). Such contextual information is routinely used by hypertext authors (`if the user is reading this page, they are assumed already to know about X'). This case is exceptional for a second reason: the context applies to all users, whereas generally a contextual field is tied to an individual user.
The success or failure of an application is often determined by how crisp its user interface is. In this paper we are concerned only with the user interface during the retrieval process. We are not concerned with the user interface after the retrieval process, and the display of the retrieved documents. (Thus we are not concerned with highlighting relevant terms within a document, or linking to a point or region within a document, rather than to its start.)
During the retrieval process, if all or nearly all user input is done by pointing a mouse or a finger at the screen, rather than by typing, then this can be a big advantage, especially in mobile applications. In particular this applies in cases where the user creates the query: if the user is viewing one document, and can construct the next query by selecting words or phrases within the current document that are of especial interest, then, although this might be worse in retrieval terms than typing in a new query, the convenience of the user interface might more than compensate for this loss.
Having discussed some of the background issues, we will now look more closely at the components of the three-stage retrieval process.
A fundamental issue is structure of the information to be retrieved. In information retrieval the data is typically unstructured or semi-structured. If information is fully structured, in the sense that each document is divided into exactly the same fields, then database technology is typically used. Although not totally structured, IR document collections are, however, often divided into fields, and the query may refer to individual fields, e.g. `Match XXX in the Author field and YYY in the Title field', or, with a context-aware application, `Match XXX in the Location field and YYY in the Time Field'. As we observed when discussing context, different fields can have different weights, e.g. that Location is twice as important as Time. Moreover these weights may change dynamically, e.g. in a context-aware application for tourists, the field whose value has changed most since the last query may get highest weight.
The design and issuing of a query (together with other aspects such as specification of weightings and of the document collection to be used) can be done by the user, or can be done wholely or partly by an assistant. The assistant can be a human or a program. In the human case, the assistant's work can be done in advance, as with an expert author who has designed how readers will retrieve information, or can be a person who interacts directly with the user -- indeed they could be sitting side by side. The assistant may design the query and/or may decide when to issue the query.
The user may want the assistant to specify the query because:
Often the specification of the query is done jointly by the user and the assistant: the user first specifies the query, but, behind the scenes, the assistant enhances it to factor in additional considerations. (Actually, if we think in implementation terms, we may be over-simplifying how the assistant works. In the implementation it may be more convenient to represent assistants as components in a pipeline of retrieval operations, rather than as contributing to a single monolithic query.)
The document collection may be fixed by an application. Alternatively it can specified by the user or by an assistant, e.g. an intelligent resource-discovery agent that finds the most appropriate document collection on-the-fly. An example of such an assistant would be an agent that found a document collection that gave traffic information about locations that the user was close to or heading towards.
In any push technology, it is not the user who issues the query; instead the the retrieval engine does it. In some areas, such as Information Filtering (IF), the query is still designed by the user. Here the reason that control is taken from the user is that they would not know when to issue the query (e.g. in the IF case they do not generally know when each new document arrives), and, even if they did, it would be tedious to continually issue the same query.
Proactive context-aware retrieval (CAR) systems are similar to IF systems, but the queries are associated with each document; they are prepared in advance by an author, not by the user. For example a document associated with a garden might have the associated query `is the user's location near the garden, and does the time correspond to the garden's opening hours'. The query attached to a document can be regarded as a form of metadata; indeed the query need not be explicit, but could be automatically derived from metadata attached to the document. For example the document associated with the garden might have some `requirements' metadata that has two fields: a location and a time. In addition the nature of the dynamic elements is different from IF. In proactive CAR the document collection may well be completely static (whereas it is dynamic in IF): the dynamic element is the user's current context, against which the queries attached to each document are matched. Typically CAR is automatically performed whenever there has been some significant change in the user's context (the context typically includes time, so one criterion for a new retrieval can be that time has advanced by a certain amount).
There are CAR systems that are interactive rather than proactive, but their operation is basically similar.
Overall the common factor of all these cases is that change (a new document, a change in the user's context) causes retrieval to occur, and the assistant may know much more about change than the user.
The table shows the properties of some existing applications. Clearly, since systems vary, the table can only give an overall impression rather than a definitive statement for every existing system. Within the table we use the suffix `/Adv' to mean `is (or may be) done in advance'. The `User' means the end-user. The table row labelled `Generic link' describes the facility first offered by Microcosm [Dav92, Hall96].
Query | Document collection | Push/Pull | |
---|---|---|---|
IR | User | User or (Application/Adv) | Pull |
IF | User/Adv | (User or Application)/Adv | Push |
CAR | Assistant | User or (Application/Adv) | Push or Pull |
WWW link | Human/Adv | Human/Adv | Normally Pull |
Generic link | User | Human/Adv | Pull |
Autonomous agents | User/Adv | Assistant | Push or Pull |
There is usually a degree of uncertainty on whether the delivered documents will really be of interest to the human end-user -- if there was no uncertainty the selection stage is irrelevant as the retrieved documents can be delivered direct to the user. The selection stage should give the user as much help as possible in resolving this uncertainty. There are two standard ways of doing this, and both can be used together:
Overall the ranked list and the labels provide an opportunity for the application to explain to the user why each delivered document may be relevant to them. Obviously this is easiest in a hypertext system, where the documents are known in advance and a human author writes the labels and provides any ranking (`Here are six papers, in order of increasingly complexity, that explain more'). However there are also opportunities for automatic systems to provide further information to users (`this garden is very close to your current location, is open, and matches your interest in conifers; you have not apparently visited any gardens on your current trip'); such opportunities are not widely exploited at present, but, we believe, could be an important part of the success of an overall system.
Some hypertext systems support one-to-many or even many-to-many links, rather than the traditional one-to-one link. In terms of the model presented here these are not fundamentally different from one-to-one links: they just offer a richer user interface for selection.
To return to the two extremes presented at the start of this paper, an IR system offers the ultimate in freedom: the user has control over the complete retrieval process. A web page, on the other hand, is in retrieval terms a highly constrained system: the author has done all the retrieval work in advance, and the only choice the user has is to select one of the links provided. If a constrained system suits the user's needs, this is ideal: the user has been saved a lot of work by the author who successfully constrained retrieval to cover just what the user needed.
Unfortunately, however, no application will meet the needs of all users in all situations. In particular sometimes the user will want less constraint, and sometimes the user will want more help from the application, often in the form of constraining a large number of possibilities into a smaller number, better tailored to the user. Thus many applications provide extra mechanisms, either to remove constraints or to impose them.
One approach to removing constraint is as follows: the user is viewing a web page, and wants other information, which is not covered by the links provided. The user then selects, from within the web page, one or more words (or passages) that are of especial interest, and hits a button called `Retrieve' or the like. The relevant documents are then retrieved and delivered. In some applications the document collection from which retrieval takes place may be a limited one, but especially tailored to the material in the web presentation being accessed. For example the document collection may take the form of a dictionary: if the user selects a word in the current web page, and if that word is in the dictionary, the dictionary entry is displayed. The dictionary need not be a comprehensive one: it could just be a glossary of special terms used in the web presentation, or of topics for which there is further information (like generic links in Microcosm).
The above process, based on finding documents related (a) to the document the user is currently reading and (b) to an individual user's context, can be automated. There are many systems, some commercial products, that do this. One class of these is the Just-in-time Information Retrieval agents of Rhodes and Maes [Rho00]. These create a retrieval query based on what the user is currently reading and/or writing, and on the user's context (which includes past history). This query leads to the retrieval of some documents, hopefully highly relevant to the user's current activity, and these documents are presented discreetly (and discretely!) to the user.
Direct hypertext links as found in WWW represent only one of a large number of possible types of link. Of course, WWW offers other types of link, such as links to CGI scripts, but these still represent a subset of the possibilities. A much wider classification of links is provided by DeRose [DeR89]. In this classification there are two overall types of link: (a) an extensional link, where the link is essentially an ad hoc connection to one or more possible documents, and (b) an intensional link, where the link is derived from executing a function. One example of such a function is a CGI-script, which in effect creates a new document on-the-fly and links to it. Another is what DeRose calls a retrieval link. A retrieval link creates, on-the-fly, a link to some existing document (which may come from some restricted collection of documents or the whole `docuverse'). DeRose's classification represents a theoretical description, rather than a taxonomy of existing systems. In principle, however, the function that drives an intensional link can do anything, and can cover all the possible functionality we have described here. Hence it could provide any degree of freedom up to full IR.
Up to now we have spoken in terms of a single context for the user, but this is simplistic. Most of us wear several hats; we may be a researcher in X and Y, a teacher in X and Z, an administrator, a hobbyist, a traveller, ... . Thus we have several possible contexts, and continually switch between them. Contexts typically consists of an aggregate of several components, which we have called `fields' -- to parallel the fields within documents. Different contexts may share some contextual fields. For example the current time will be the same for nearly all of them: however an exception to this would occur if the traveller set their time to a pretended value, representing a future time at which they plan to travel. To meet the need for multiplicity, we assume that our application maintains not one context, but several.
In this final part of the paper, we postulate some applications that cover wide parts of one or more dimensions of the spectrum. In some cases existing systems already come close to these postulated applications. Our postulated applications represent combinations of existing applications, e.g. C + D + a dash of E and F. In such cases we will somewhat arbitrarily take one component as a starting point, C say, and build from there. Before describing the applications, we will describe a piece of infrastructure that the applications need:
We believe that the read/write interface is a simple and powerful aid to allow the user to influence the retrieval process. We now postulate the applications. They exploit a read/write interface; they also cater for multiple contexts.
Our first example starts from IF, and our postulated application is called SUPERIF. As with normal IF the query is supplied in advance. Instead of working from a pre-defined document collection, SUPERIF uses a resource-discovery agent to find documents that meet the users' needs, as given by the query that has been supplied in advance. When a user first employs SUPERIF, he may optionally choose to set an initial query; in any case whatever query the user supplies is automatically supplemented by SUPERIF. This is done by a process of deduction from looking at each user's retrieval behaviour, and evolving the query continually (e.g. daily or weekly). Even in our make-believe world, however, it would be unrealistic to expect this process of query deduction and evolution to work well all the time. Thus the user will sometimes want to intervene, and to modify the query automatically constructed for them; hence the application must be able to present the query to the user in a comprehensible and easily changeable form.
In line with our earlier comments about multiple contexts, SUPERIF creates several queries for each user, one for each hat they wear. SUPERIF is radical in the way it delivers retrieved documents. It only delivers documents when the user is detected as being in a context where `they have the right hat on'. Thus if a query related to research papers, this would be delivered when the user next read or wrote a document that related to that activity. (There might be some special process for urgent documents: e.g. delivery with any hat on.) SUPERIF has proactive delivery of information. It presents the set of delivered documents in a read/write interface, so that the user can make annotations or changes to help selection or subsequent retrieval.
SUPERIF may cater for short-term or long-term retrieval needs. Movement on the spectrum between short and long term can be achieved by adjusting weightings in the use of context. Thus for short-term needs the present context has a higher weighting than history. For long-term needs the present is less important.
Our second application, SUPERIR, starts from IR as a base. SUPERIR only does context-aware retrieval. Since part of the context is the current document of interest, the user can simulate a current search engine by creating a new document and just typing some search terms into it. (At an extreme they could ask for all other aspects of their context to be shut out, thus relying solely on the search terms they have just typed.) In many cases they will start by inputting an existing document that comes close to their needs, and will add annotations to it to capture their needs better. This, together with the rest of the user's context, will be used as a basis for retrieval. SUPERIR, like SUPERIF, caters for multiple contexts. SUPERIR is geared to the user's short-term needs, and cases where the query is not supplied in advance but interactively on-the-fly.
In designing our two hypothetical applications, we could have used various criteria for distinguishing them, e.g.: (a) one for short-term needs and one for long-term, or (b) one for proactive delivery and one for interactive. We do not, however, believe that either of these is fundamental. Instead the criterion we have used is whether queries are specified in advance, or whether they are supplied interactively on-the-fly. In a sense this criterion is not fundamental either; however we believe it is fundamental to having any chance of an efficient and hence usable application that caters for a large number of documents. Thus we believe that an implementation-based criterion must, at least in the foreseeable future, take precedence over others. The criterion is part of the difference between IR and IF, and the whole key to each of these has been to provide optimisations based on the parts (the queries or the document collection) that are known in advance. To be practical we think therefore that SUPERIR, which has dynamic queries, would have to know the document collections in advance. Obviously, however, there will be specialised applications -- e.g. a traffic information system where all documents were indexed by location -- where high performance can be achieved in a totally dynamic world.
At the start of this paper we quoted the view -- a view widely held -- that, in many real situations, following hyperlinks is too restrictive and using a general IR search is too permissive. The user wants something in between the extremes, but there is an added, sometimes implicit, requirement that the something must require no more effort than the extremes, and ideally should require less effort. A key to achieving this is automatic processes that select or enhance the query, choose the document collection, and perhaps proactively deliver documents. The most natural way to accomplish this is to collect and exploit information about the user's context, and to use this in the automatic process. A further aid, affecting both convenience and performance, is to perform some stages of the retrieval process in advance.
The final retrieval stage often requires the user to select from a number of delivered documents. Tools to aid retrieval should also present information to help the user in this selection process, i.e. to explain why the application thinks a document is relevant to the user. The more the retrieval process is automatic, the more this explanation is necessary.
Currently we have a plethora of retrieval tools, each representing a point on a spectrum. The future surely lies in making these tools work together in a seamless way, and making each tool cover a wider part of the spectrum. We have proposed two tools, SUPERIR and SUPERIF, to this end. Each knows some information in advance, and this offers a chance for a practical and efficient implementation that caters for document collections of a realistic size.