From information retrieval to hypertext linking

P.J. Brown

Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK

e-mail: P.J.Brown@exeter.ac.uk

ABSTRACT

There is a plethora of approaches to retrieval. At one extreme is a web search engine, which provides the user with complete freedom to search a collection of perhaps over a billion documents. At the opposite extreme is a web page where the author has supplied a small number of links to outside documents, chosen in advance by the author. Many practical retrieval needs lie between these two extremes. This paper aims to look at the multi-dimensional spectrum of retrieval methods; at the end, it presents some hypothetical tools, hopefully blueprints to the future, that aim to cover wider parts of the spectrum than current tools do.

INTRODUCTION

Hall (1) highlights two extremes in methods of retrieving information: (a) the traditional hypertext fixed link, which leads to a single document, and (b) the web search engine. Hall, together with several other researchers, has prophesied that the future lies between these extremes. This applies both on fixed terminals with large screens and, even more, on personal devices: since these have small screens and generally sparse computational and input/output resources, it is imperative that any retrieval approach be optimised to the user's real needs. The purpose of this paper is to explore the spectrum of possibilities between the two retrieval extremes; not surprisingly, it is a multi-dimensional spectrum. In essence each dimension of the spectrum goes from complete freedom (the user is on their own) to high levels of constraint (decisions have been made on behalf of the user by a human -- such as the author of a hypertext page -- and/or a computer program).

We will start by giving some examples of applications that represent points on the spectrum, beginning with the most general forms of retrieval and moving towards more focussed retrieval.

IR. One extreme of the spectrum is general Information Retrieval (IR), as represented by web search search engines. Each user of the service supplies a query that captures their needs, and the retrieval system searches a huge number of documents, perhaps over a billion, to find those documents that best match the query. Each document is given a score according to how well it matches the query. The retrieval engine delivers to the user an indication of the ten (say) best matched documents. The user then selects from these matched documents.
IF. A variant of IR is Information Filtering (IF). In this case the user supplies the query in advance (it is called a profile) and this query is regularly applied to a dynamically changing body of documents. New documents that match the query are delivered to the user. An example of the use of IF is a notification service for new research papers. The user supplies a query that describes their research interests; the IF engine, whenever it receives a batch of new research papers, applies each user's query to each research paper, and delivers to the user those documents that match their query. (IF is itself part of a wide spectrum of different types of notification services.) IF has moved away from IR in three directions: (1) some of the user's work, the specification of the query, is done in advance; (2) delivery of documents to the user is proactive, rather than interactive; (3) IF engines are optimised to cater for many users, and hence for many simultaneous queries applied to each new document -- such optimisation is possible because the queries are known in advance and can be preprocessed; the preprocessed form of the set of queries (which could number many thousand) creates a homogeneous whole, which gives much faster performance than applying each query in turn.
Site searching. This is a variant of the general search engine. As an example of this application, the WWW user enters a web site for a company called WidgetCo, which provides a search facility. The user specifies the search terms, and then selects a `Search' button. This runs a CGI script supplied by WidgetCo that searches for these terms, in much the same way as a web search engine. The difference is, however, that the scope of the search is just the documents within WidgetCo's web site. More generally, a set of web pages can be structured into a hierarchy, and the Search button may just cover documents lower in the hierarchy than the user's current position. Site searching is an example of the wider field of constrained information retrieval. In site searching, the constraint, i.e. deciding which documents are covered, is made by the information provider rather than the end user.
Context-aware retrieval. The next application introduces automatic tailoring to the user's needs. In this application the user is mobile and carries a PDA with attached context sensors for location, temperature, nearby equipment, etc. Documents are delivered to the user only if they relate to the user's current context. For example the user may be a tourist, and the documents delivered might describe suitable tourist attractions nearby. Such context-aware retrieval applications may be proactive: every few minutes documents are automatically delivered to the user. Alternatively context-aware retrieval may be interactive: documents are only delivered when the user hits a `Give-me-info' button. This application differs from ordinary IR in that the query is (partly or wholly) created automatically, and does not need to be typed by the user. In the case where retrieval is entirely automatic, the user does nothing more than look at the documents retrieved by a process they have no direct interaction with; thus the application then lies at an extreme of the spectrum in that all decisions are made on the user's behalf.
User-tailored retrieval. This is simply a variant of context-aware retrieval, but uses a different type of context. Here a user issues a request to retrieve some information, but the information retrieved is tailored to that particular user. For example the retrieval may take account of the user's role within their organisation, what sort of documents they normally read/save/bookmark, what feedback has been provided on the relevance of previous documents delivered to the user, whether any of the newly retrieved documents has been seen by the user already -- and if so how long ago, etc.
Dynamic hypertext linking. In simple hypertext linking -- which comes later as our last example -- the source and destination of each link are fixed in advance. In dynamic hypertext linking, however, either or both may be calculated by a function. This function could, for example, search for words in a glossary, and insert links from the current document to the glossary. As an even more dynamic example, the function could look for part-numbers in the current document and link them to appropriate inventories of parts currently in stock.
Composite retrieval. This is where several types of retrieval activity are combined before delivery to the user. An example, which represents a pipeline of retrieval processes, is as follows: (A) a retrieval process yields some documents; (B) a second stage, sometimes called a `filter' (2), takes the documents from the first stage and extracts a subset of them, perhaps changing their scores; (C) any number of further filters can be applied ... . As a slightly more complicated example, there could be N autonomous retrieval agents, each retrieving a set of documents. These N sets could be brought together by a final consolidation stage which extracted a single subset from the N sets of documents.
Web links. The final application, representing the opposite extreme to IR, is the simple web link. In line with the theme of this paper, retrieving information, we here concentrate on links that lead to outside pages which give further information about the topic in hand -- Conklin (3) calls these referential links. As an example, a web page might have eight referential hypertext links embedded in it, and the user selects one of these. Here, in comparison with general IR, the retrieval of information that might be of interest has been done in advance by the author. The author has decided that eight documents may be relevant, and has presented these as links. In this application the author has a big advantage: they have a piece of context, in that they know that the user has just read (some of) the current page, from which the links emanate. We discuss this later.

These applications just represent points on a wide, multi-dimensional spectrum; we discuss these dimensions below.

EXAMPLES OF HYBRID APPLICATIONS

The set of examples presented above tried to highlight the differences between applications by showing `pure' examples of each. There are, however, several examples of applications that combine the properties of two or more of the pure examples. In particular there are numerous applications that bring together browsing and searching. We will mention just two of them here in order to give a flavour: each encompasses a well-designed combination of the two facilities, rather than a simplistic cobbling together. Firstly, ScentTrails (4) is a browser where the user supplies a set of search terms indicating their current interests. The user might update these terms as the search proceeds. When the browser presents a hypertext page, it looks at the links within the page, and gives them different weightings according to their apparent relevance to the user's search terms. The more highly weighted a link the more it is highlighted, e.g. by using an increasingly large font. ScentTrails uses a sophisticated algorithm to calculate weights, taking account of the linking structure and of the occurrences of search terms within each page.

A second interesting hybrid application is that of Cuncliffe, Taylor and Tudhope (5), and is designed for museums. The application has a supporting data structure that contains measures of the semantic closeness of items. The application supports both browsing and search queries, and the semantic data structure is used to help integrate the two, and to suggest alternative approaches when the user comes to a dead end on one approach.

MORE FUNDAMENTAL APPROACHES

The discussion in this paper concentrates on existing and potential applications. A more fundamental approach is to look at human information-seeking behaviour in general. A comprehensive treatment of this can be found in the book by Marchionini (6). Marchionini's analysis covers not only computer applications but cases where the information provider is a human, e.g. asking a colleague rather than performing a search using a computer. Marchionini separates out two different strategies in information seeking: analytical strategies and browse strategies. The former, of which IR is an example, are planned, goal-driven, deterministic, formal and consisting of discrete steps; in contrast the latter tend to be opportunistic, data-driven (e.g. the data in one hypertext page provides links to the next), heuristic, informal, and consisting of a continuous sequence of steps. Obviously there is scope, in most information-seeking tasks, to use a mixture of browsing and analytical strategies. There is also scope for finding new approaches that come mid-way between analytical strategies and browse strategies.

An earlier paper, pre-dating the web, presented human information-seeking needs in a three-dimensional model (7), where the dimensions are structural responsibility, target orientation and interaction method; the paper gives eight exploration paradigms that represent the vertices of the model.

TERMINOLOGY AND ASSUMPTIONS

Different retrieval technologies often use different terminology. For consistency we will here choose one terminology: the terminology of Information Retrieval. Thus we have a query, which specifies the user's needs, and a document collection from which documents that match the query are retrieved. The retrieved documents are delivered to the user. Following Marchionini's example, we will take a wide view of what these terms cover: e.g. a document collection could be a set of documents in a human's mind, and the query could be a thought that the human has, e.g. `I need to provide some cross-references on topic X; which documents would be best for this?'.

Retrieval can be applied to any nature of material, but here we will assume that everything is textual, since this is still by far the most common sort of retrieval. The principles discussed here should, however, carry over to other media. We assume the end user is a human, who wants some information and hopes to acquire this information by reading one or more of the delivered documents.

We use the word application as a generic term to cover all the retrieval systems we discuss: thus an IR system is one application, and WWW and Microcosm are examples of hypertext applications.

As a final piece of terminology, which we use when talking about the web, a web presentation is a set of integrated web pages, such as the set of pages describing a company's products.

THE THREE-STAGE RETRIEVAL PROCESS

Having defined the terminology, we now present a general model of retrieval. In this model, retrieval is a process of three stages:

Specification of (1a) the query and (1b) the document collection: the query and the document collection to which it is to be applied are specified. Sometimes the retrieval process will not offer any choice in the document collection, e.g. a search engine might always search the same set of pages.
Query issue and document delivery: the query is sent to a retrieval engine. This engine extracts zero or more documents from the document collection and delivers these to the user. Huge progress has been made over the past thirty years in improving retrieval engines. Recently there has been further progress in fusing different approaches (8), introducing notions of semantics, and taking advantage of any known structure in target documents (e.g. weighting title lines more highly). We are not, however, directly concerned here with the workings of the retrieval process, or indeed whether the `retrieval engine' is a human rather than a computer program.
Selection: assuming that more than one document is retrieved, the last stage is selection of a subset of the delivered documents. This selection is normally done by the user, and web search engines represent a typical example of the interface provided: each delivered document is represented by a descriptive label (e.g. the label on a hypertext link) and the user selects the documents(s) they want to view by clicking on label(s) -- usually the interface is designed for selection of one document at a time.

Actually the three-stage model is a simplification of what may really happen in practice. In particular there may be iterations involving either the first two stages or all three. One of our introductory examples -- the composite retrieval application that involved a succession of filters -- illustrated iteration around the first two stages.

Iteration over all three stages might occur if a delivered document is not a predefined document but instead is a process for creating a document, like a CGI script on the web. This process itself could involve further retrieval, as would apply if the document delivered on one retrieval was a link to a search engine that then performed a further retrieval.

Iteration over the three stages also occurs if the user is not satisfied with the documents delivered: the user may reformulate the query and/or provide relevance feedback, and as a result get a new set of documents.

DOCUMENT FRAGMENTS AND DYNAMIC DOCUMENTS

After a document has been selected, there can be a further step before the document is presented to the user: the identification of fragments that might be of special interest to the user. These fragments could be occurrences of search terms within the document -- these might be highlighted when the document is displayed; alternatively a fragment could be identified by a suffix attached to a URL (thus causing the browser to jump to a certain point in the document), or, in an XML document, by a specification written in the Xpointer language (9) that dynamically searched the document's structure to find the relevant fragment.

ScentTrails, as mentioned above, represent a specific example of an application that highlights relevant fragments: in its case the highlighted fragments are the links that appear to be most relevant.

An interesting step beyond identification of fragments is dynamic generation of the retrieved document using the fragments that are most likely to interest the current user, an instance of Adaptive Hypertext, e.g. (10). Such facilities can be found for example in applications, such as Intelligent Labelling Explorer (11) and HyperAudio (12), designed to retrieve information for visitors to museums. A further example, in a interestingly different area, is the creation of personalised hypertext fiction using story fragments (13). In all these cases the choice of fragments used to construct a dynamic document can take account of the user's current context (their location, orientation, interests, preferences) and past contexts (e.g. documents previously viewed and the amount of time spent looking at them).

Another example, again entirely different, arises when a document is generated that reports on the retrieval process itself, taking advantage of information gleaned from the structure of the document collection. An example is OpCit (14). OpCit's document collection is a large set of research papers: when one paper cites another, this is treated as a hypertext link. Some users may want to retrieve individual papers, but others might want to know how many papers cite a given one, whether these citations are from leaders in the field or whether these citations just come from the authors' colleagues. OpCit can generate a document that answers such questions.

Fragment identification and dynamic generation of documents can be used in any retrieval process, and indeed offer a promising avenue for further advances. They are, however, separate from retrieval itself, and in the rest of this paper we will discuss retrieval in terms of whole documents retrieved.

TIMING

Each of the above three stages of the retrieval process can be performed some time in advance of its successor, with the pre-calculated results being continually re-used by the successor stage. In the first stage either the query or the document collection, or both, can be specified in advance. An example is information filtering: here the user specifies the query in advance and this query is continually re-used for retrieval, perhaps over a period of years, until the query is re-specified.

One way of thinking about the way web pages, with their fixed links, are created is as follows: the first two stages of the three-stage process are done in advance (usually by a human author, but perhaps by automatic tools that help create web pages) and this is presented to the user as a set of links embedded in the page. The interactive user is only involved in the selection stage, where they pick the link to follow. Thus, as far as the hypertext user is concerned, it is a `Stage 3 only' process.

If retrieval is an iterative process, say of two iterations, then the first iteration can be done in advance -- typically creating a small cache of retrieved documents extracted from a large document collection -- and the second stage can be done on-the-fly. This is particularly useful for context-aware retrieval on small devices, where the domain of retrieval, i.e. the contexts to be covered, can be forecast in advance.

The advantages of doing work in advance are many: speed of response is increased, repetitive work is avoided, work may be shared across many users, data transmission charges may be reduced, and (with downloaded caches) problems with lack of connectivity can be surmounted. A prime example of the gains from doing work in advance are web search engines: these can search a billion documents in an impressively short time; two keys to doing this are (a) collecting the documents in advance and (b) pre-processing the document collection into a surrogate form that facilitates fast retrieval. A web search engine that worked entirely on-the-fly would not be a runner. In general, if an application deals with really large amounts of information, and if the application needs to deliver results in real-time, then it is imperative that some work be done in advance.

There are, however, disadvantages. The danger of doing work in advance is that the world may change and the work done in advance may be invalidated. The best known example of this is the dangling hypertext link: the author has retrieved a document in advance and provided a link to where the document was found, but when the user selects the link the document is no longer there. Doing all the stages on-the-fly greatly reduces this danger.

DIMENSIONS

As our introductory examples indicated, the process of obtaining information has many dimensions, as represented by the following parameters:

the query: the space of possible queries; who specifies the query; who issues the query.
the document collection: which documents are in the collection, and who decides which collection is used.
iteration: whether the retrieval process involves iteration over some or all of the three stages.
timing: whether some stages are done in advance.
delivery: whether delivery is interactive, i.e. controlled by a request from the user, or proactive, i.e. controlled by the application. To put this another way, the question is whether push or pull technology is used.

Each dimension represents a spectrum of possibilities (though it is a narrow spectrum in the case of delivery if there are just the two possibilities: proactive and interactive). A common factor among all of these dimensions is the choice between being free and being constrained: whether the user has complete control and responsibility, or whether the application takes some of this control and responsibility to reduce the burden on the user.

In subsequent Sections we discuss how existing tools cover parts of this multi-dimensional spectrum, how these tools could be generalised to cover more of the spectrum, and how new tools might be designed to cover areas of the spectrum that are not well covered at present. Before discussing details, however, we will describe a general consideration: the use of context.

USING CONTEXT

In all the application areas that we have described there is a continuing quest to make the delivered documents more relevant to the user. The general approach to tailoring information to a user is to collect information about that user or about other similar users who have the same needs. We call this the user's context. Context can potentially be wide-ranging (15). Sensors can detect the user's physical context: where they are, what direction they are going, what companions/equipment are nearby. A further component of context, the user's computing context, is easy to capture: what document they are reading, what other applications they are running, etc. (In terms of context capture, the document being read is of interest not only for its content, but also for any metadata or semantic information associated with it.) In many situations the document being read is the most important aspect of the user's context, since it shows the user's current focus. Wider aspects of context can be extracted from the web: the current weather and the weather forecast, share prices, traffic information, etc. All of these can have a bearing on what retrieved documents are currently relevant to the user. The user's past behaviour can be analysed, perhaps with some feedback from the user on what retrieved information was most relevant, and this analysis can be extended to cover other, similar, users whose past behaviour may be a guide to the current user (as in peer-to-peer search engines). More generally the context can encompass user models and task models.

Overall the context consists of many different items -- we call them fields; at any one time some fields may be irrelevant to the retrieval process whereas others may be highly relevant. For example a field representing the user's location may currently be important, whereas share prices may not. Thus we have a concept of a weight attached to each field, and these weights can change dynamically according to the user's needs. Weights are typically set by automatic tools, since it would be a burden on the user to continually set and update them.

The main advantage of context is that, if it is collected automatically and then used effectively, it can bring big improvements to the relevance of the documents delivered, all without any extra user effort. There can, however, be further gains if the user does make an extra effort, by telling the application something about their nature and needs, e.g. whether they are a beginner or expert, what topics they are currently interested in, and where they are travelling to. This can be done as a occasional activity, in advance of the retrieval requests.

For all types of context it is useful if the application maintains a history, e.g. a trail of locations visited, or documents previously viewed. This `history' may encompass the future too, as derived, for example, from future diary entries. An example is that the user's diary might say they should be at a certain location in two hour's time; this element of context might have an effect on the user's current information needs. Context can be further enriched by guessing higher level contextual states from the values of low-level sensors. For example Pepys (16) used active badge information to detect whether a meeting was taking place (several people converging to the same place at roughly the same time, and then staying there). Being in a meeting is an important factor of a user's context, and should affect which documents are delivered to them, and, indeed, the user interface for delivery and selection.

Overall, therefore, the context can represent a rich resource for helping to find what information the user is interested in. In our introductory set of examples, the one labelled `context-aware retrieval' showed the manifest exploitation of this resource, but context could be valuable in many of the other examples too. Exploiting physical context is probably easiest, e.g. the location of a mobile user. Matching of locations is easy (though not trivial, for instance when the user is constrained by streets) and reliable in the sense that all the documents whose associated locations are close should be delivered, and all those that are far away should not. At the other end of the scale, looking at the contextual history of a user's past retrievals, and making this influence the current retrieval operation is hard and potentially unreliable -- occasionally it will lead to irrelevant documents being delivered or relevant documents missed. However most web browsers provide a small step in this direction: they keep a history of links followed, and, each time a link is displayed, display it in a different colour if the destination of the link has already been visited. The web browser does not make any judgement on the relevance of this information: it just tells the user and lets them judge. It would also be relatively easy for a web browser to test if the information at the link destination has changed since the user last viewed it, and alert the user if this is so.

Frequently it is the current context, rather than the historical context, that is most relevant, and this context may be changing quickly. Often the user's current context cannot be known accurately in advance -- and certainly not well in advance: in these cases context is best used in retrieval processes that are on-the-fly.

There is, however, one case that is exceptional: as we have said, in a hypertext system, the author, when embedding links in a page, knows that the user will be reading that page when she selects the links -- thus the author knows in advance the document (i.e. web page) being read. This is, of course, part of the user's context. If the page is part of a hierarchy or a sequence of pages, the author may further deduce how the user reached the current page, and what documents they saw on the way; in addition the overall nature of the web presentation to which the page belongs may be a guide to the nature of the user (e.g. the presentation may be `Mathematics for the conceptually challenged'). Such contextual information is routinely used by hypertext authors (`if the user is reading this page, they are assumed already to know about X'). This case is exceptional for a second reason: the same context applies to all users, whereas generally a contextual field is tied to an individual user.

Having discussed the issue of context, we will now look more closely at the components of the three-stage retrieval process. This discussion occupies a substantial part of the rest of the paper.

STAGE 1: SPECIFICATION

A fundamental issue is the structure within the documents to be retrieved. In information retrieval the data is typically unstructured or semi-structured. If information is fully structured, in the sense that each document is divided into exactly the same fields, then database technology is typically used. Although not totally structured, IR document collections are, however, often divided into fields, and the query may refer to individual fields, e.g. `Match XXX in the Author field and YYY in the Title field', or, with a context-aware application, `Match XXX in the Location field and YYY in the Time Field'. As we observed when discussing context, different fields can have different weights, e.g. that Location is twice as important as Time. Moreover these weights may change dynamically, e.g. in a context-aware application for tourists, the field whose value has changed most since the last query may get the highest weight.

The design and issuing of a query (together with other aspects such as the specification of weightings and of the document collection to be used) can be done by the user, or can be done wholly or partly by an assistant. The assistant can be a human or a program. In the human case, the assistant's work can be done in advance, as with an expert author who has designed how readers will retrieve information, or can be a person who interacts directly with the user -- indeed they could be sitting side by side. The assistant may design the query and/or may decide when to issue the query.

Specifying the query

The user may want the assistant to specify the query because:

the user does not have the skill to formulate the query or to use the retrieval engine efficiently. In the early days of bibliographic information retrieval, for example, the average user was not allowed to use the system directly, but instead interacted with a human assistant, perhaps a member of the library staff, who entered the query. More recently huge strides have been made in improving the interfaces and reducing the cost of retrieval engines, with the result that this mode of working has disappeared from general use. There remains a vestige, however. A typical search engine discourages users from the more advanced facilities: instead users need explicitly to select "Advanced search" if they want to use these facilities.
the assistant knows best (a) what information the user wants and/or (b) what subsidiary data about the user can be used to improve the relevance of the delivered information. The assistant's advantage may derive from having extra knowledge about the user's context and how different fields of this context should be weighted. Overall the assistant is assumed to be an expert, either because they have superior knowledge of the subject matter, or, at a much lower level, they have access to a lot of detailed information, such as sensor readings, that might help in formulating a query. Often, the assistant will just be making an educated guess at the user's real information need; if they continually get this wrong, they will, of course, be dismissed.
the effort of designing queries is an annoyance or distraction to the user. For example the reader may be mobile and/or may be engaged on some other task.

Often the specification of the query is done jointly by the user and the assistant: the user first specifies the query, but, behind the scenes, the assistant enhances it to factor in additional considerations. Actually, if we think in implementation terms, we may be over-simplifying how the assistant works: it may be more convenient to implement assistants as components in a pipeline of retrieval operations, rather than as contributing to a single monolithic query.

Specifying the document collection

The document collection may be fixed by an application. Alternatively it can be specified by the user or by an assistant, e.g. by an intelligent resource-discovery agent that finds the most appropriate document collection on-the-fly. An example of such an assistant would be an agent that found a document collection that gave tourist information about locations that the user was close to or was heading towards.

Even if the nature of the document collection is known in advance, its content might not be; this would apply, for example, to a document collection about current traffic problems. Most IR applications, however, depend on knowing the content in advance.

STAGE 2: ISSUE AND DELIVERY

After the query has been specified, it is subsequently issued to the retrieval engine. In any push technology, it is not the user who issues the query; instead the the retrieval engine itself does it. Push technology has, of course, existed for a long time, well before electronic pushing became viable: for example a librarian might perform Selective Dissemination of Information (SDI) by sending notes to users when new information becomes available. Lessons learned from this, and indeed from the work of librarians in general, carry over to the present day and may help prevent re-inventing of wheels that are less round than their predecessors (17).

In some applications that use push technology, such as Information Filtering (IF), the query is still designed by the user. Here the reason that control at Stage 2 is taken from the user is that they would not know when to issue the query (e.g. in the IF case they do not generally know when each new document arrives), and, even if they did, it would be tedious to continually issue the same query.

Proactive context-aware retrieval (CAR) systems are similar to IF systems, but the queries are associated with each document; they are prepared in advance by an author, not by the user. For example a document associated with a garden might have the associated query `is the user's location near the garden, and does the time correspond to the garden's opening hours'. The query attached to a document can be regarded as a form of metadata; indeed the query need not be explicit, but could be automatically derived from metadata attached to the document. For example the document associated with the garden might have some `requirements' metadata that has two fields: a location and a time. In addition the nature of the dynamic elements in CAR is different from IF. In proactive CAR the document collection may well be completely static (whereas it is dynamic in IF): the dynamic element is the user's current context, against which the queries attached to each document are matched. Typically CAR is automatically performed whenever there has been some significant change in the user's context (the context typically includes time, so one criterion for a new retrieval can be that time has advanced by a certain amount).

There are CAR systems that are interactive rather than proactive, but their operation is basically similar. Overall the important point about all these cases is that change (a new document, a change in the user's context) causes retrieval to occur, and the assistant may know much more about change than the user, and may thus be able to facilitate the retrieval of documents that are relevant to the changed circumstances.

STAGES 1 AND 2: SOME EXISTING APPLICATIONS

Table I shows the properties of some existing applications. Clearly, since systems vary, the table can only give an overall impression rather than a definitive statement for every existing system. Within the table we use the suffix `/Adv' to mean `is (or may be) done in advance'. The `User' means the end-user. The table row labelled `Generic link' describes the facility first offered by Microcosm (18); a generic link creates at run-time a link from a word or phrase to places where more information about the word or phrase can be found.

*TABLE I: different types of retrieval process*
	Query	Document collection	Push/Pull
IR	User	User or (Application/Adv)	Pull
IF	User/Adv	(User or Application)/Adv	Push
CAR	Assistant	User or (Application/Adv)	Push or Pull
WWW link	Human/Adv	Human/Adv	Normally Pull
Generic link	User	Human/Adv	Pull
Autonomous agents	User/Adv	Assistant	Push or Pull

The `Document collection' column refers to the nature of the document collection (e.g. Traffic Reports for London) rather than its content. However, for IR applications that deal with huge numbers of documents, the content needs to be known and pre-processed in advanced. Thus although the user may be able to choose the document collection, the choice will be confined to those whose content has been suitably pre-processed.

Pioneering hypertext systems

Researchers are always trying to break the mould, and the typical properties embodied in Table I may well be superseded in future. This applies in particular to the hypertext model of fixed links, crafted by an author in advance, embedded in a hypertext document. Instead links can be stored in a linkbase, separate from the document(s) they apply to; the user might be able to choose between different linkbases, and links might be created dynamically and/or adapted according to the user's profile. An interesting example of how far this process can go is provided by hypertext-augmented reality (19). Here the user can cause a 3D object to appear in their augmented reality -- the example quoted is an image of an aeroplane that appears to the user to be the size of a model aeroplane -- and can cause links to be superimposed on this object. For example a label might be superimposed on the engine of the aeroplane, where this label represents a hypertext link to a description of the engine (either generic to aircraft engines or particular to the engine of that individual aircraft -- in general it is a challenging retrieval problem to know whether the generic or the particular is more relevant to the user). The user specifies the types of link that they want. They do this with simulated salt and pepper pots that they can pick up and use to sprinkle links onto the object. For instance the salt could represent technical information, and the more salt that was sprinkled onto the aeroplane, the more technical information would appear. Pepper and other condiments can support other sorts of link, and perhaps different document collections to provide information.

Overall the effect is to move hypertext away from its static, stage 3 only, slot. Instead there is dynamic selection of the document collection, and, more importantly, the link structures to be used, albeit within the constraints of what authors have provided.

STAGE 3: SELECTION

There is usually a degree of uncertainty on whether the delivered documents will really be of interest to the human end-user -- if there is no uncertainty the selection stage is irrelevant as the retrieved documents can be delivered direct to the user. The selection stage should give the user as much help as possible in resolving this uncertainty. There are two standard ways of doing this, and both can be used together:

The ranked list: we assume that each retrieved document has a score, which measures how well the document matches the query (and therefore how likely it is to be of interest to the user). The order in which retrieved documents are presented to the user is determined by the scores -- usually the highest score comes first, as with most web search engines. In addition the score may be presented explicitly against each document.
The descriptive label: the user is presented with a descriptive label to determine whether the document is likely to be of interest. The label might be the document title, or it may be dynamically generated according to how the query is matched (e.g. if the query asks for a document that matches the words "marigold" and "hardy", the label may contain a document fragment that contains these words, e.g. `The pot marigold is hardy in temperate regions'). The technique of descriptive labels can be used in an iterative way, as with a ramping interface (described by Rhodes (20)). This is especially appropriate to small screens. Initially there is a descriptive label of a few words, and if the user selects it the label is expanded to a few lines, thus giving a better sense of the destination document; if the user then selects this expanded form, the whole document is displayed. An alternative approach to a ramping interface, with similar aims, is the technique of Fluid Links (21); fluid links use animation techniques to give the user a "gloss" on what the destination is, without affecting the display of the source.

Overall the ranked list and the labels provide an opportunity for the application to explain to the user why each delivered document may be relevant to them. Obviously this is easiest in a hypertext system, where the documents are known in advance and a human author writes the labels and provides any ranking (`Here are six papers, in order of increasingly complexity, that explain more'). However there are also opportunities for automatic systems to provide further information -- generated on-the-fly -- to users (`this garden is very close to your current location, is open, and matches your interest in conifers; you have not apparently visited any gardens on your current trip'); such opportunities are not widely exploited at present, but, we believe, could be an important part of the success of an overall system.

Some hypertext systems support one-to-many or even many-to-many links, rather than the traditional one-to-one link. In terms of the model presented here these are not fundamentally different from one-to-one links: they just offer a richer user interface for selection.

MOVING BETWEEN FREEDOM AND CONSTRAINT

To recap the two extremes presented at the start of this paper, an IR system offers the ultimate in freedom: the user has control over the complete retrieval process (except perhaps for the document collection to be used). A web page, on the other hand, is in retrieval terms a highly constrained system: the author has done all the retrieval work in advance, and the only choice the user has is to select one of the links provided. If a constrained system suits the user's needs, this is ideal: the user has been saved a lot of work by the author who successfully constrained retrieval to cover just what the user needed.

As we explained earlier, the premise of this paper is that unfortunately no application will meet the needs of all users in all situations. In particular sometimes the user will want less constraint, and sometimes the user will want more help from the application, often in the form of constraining a large number of possibilities into a smaller number, better tailored to the user. Thus many applications provide extra mechanisms, either to remove constraints or to impose them, thus improving the application's versatility.

One approach to removing constraint is as follows: the user is viewing a web page, and wants other information, which is not covered by the links provided. The user then selects, from within the web page, one or more words (or passages) that are of especial interest, and hits a button called `Retrieve' or the like. Documents relevant to the selected words (and to the document the user is currently reading plus other context) are then retrieved and delivered. In some applications the document collection from which retrieval takes place may be different from the original one, i.e. it is a subsidiary document collection. The subsidiary document collection is typically a limited one, especially tailored to the material in the web presentation currently being accessed. For example the subsidiary document collection may take the form of a dictionary: if the user selects a word in the current web page, and if that word is in the dictionary, the dictionary entry is displayed. The dictionary need not be a comprehensive one: it could just be a glossary of special terms used in the web presentation, or of topics for which there is further information (like generic links in Microcosm). Alternatively instead of offering a dictionary -- a highly constrained and focussed artifact -- the application might offer an opposite extreme and search the whole web for the words the user has highlighted.

The above process, based on finding documents related (a) to the document the user is currently reading and (b) to an individual user's context, can be automated. There are many systems, some of them commercial products, that do this. One class of these is the Just-in-time Information Retrieval agents of Rhodes and Maes (22). These proactively create a retrieval query based on what the user is currently reading and/or writing, and on the user's context (which includes past history). This query leads to the retrieval of some documents, hopefully highly relevant to the user's current activity, and these documents are presented discreetly (and discretely!) to the user so that their sudden arrival does not unduly interrupt the user's current task. Xlibris (23) also lies in this class: Xlibris has a pen-based interface, and one of its capabilities is to perform a retrieval search based on "ink" marks drawn with the pen by the reader in order to annotate the document they are reading. The aim, which is similar to the hybrid applications we discussed earlier, is to add a broader view to a constrained process: ideally this can lead to chance discovery of some relevant documents outside the user's normal fields of perusal.

The spectrum between freedom and constraint also has an impact on how relevance feedback is implemented: it is relatively easy to provide user facilities of the form `this is too general; I only want the sort of material that the following documents cover'; it is in general harder to allow the user to ask for a release of constraints, not least because (a) he may not understand what the current constraints are, and (b) he may not know what the world is like outside these constraints.

We now briefly look at freedom and constraint from a more theoretical standpoint. Direct hypertext links as found in WWW represent only one of a large number of possible types of link. Of course, WWW offers other types of link, such as links to CGI scripts, but these still represent a subset of the possibilities. A much wider classification of links is provided by DeRose (24) (also see (25) for a more formal analysis of links). In DeRose's classification there are two overall types of link: (a) an extensional link, where the link is essentially an ad hoc connection to one or more possible documents, and (b) an intensional link, where the link is derived from executing a function. One example of such a function is a CGI-script, which in effect creates a new document on-the-fly and links to it. Another is what DeRose calls a retrieval link. A retrieval link creates, on-the-fly, a link to some existing document (which may come from some restricted collection of documents or the whole `docuverse'). DeRose's classification represents a theoretical description, rather than a taxonomy of existing systems. In principle, however, the function that drives an intensional link can do anything, and can cover all the possible functionality we have described here. Hence it could provide any degree of freedom up to full IR.

MULTIPLE CONTEXTS

Up to now we have spoken in terms of a single context for the user, but this is simplistic. Most of us wear several hats; we may be a researcher in X and Y, a teacher in X and Z, an administrator, a hobbyist, a traveller, ... . Thus we have several possible contexts, and continually switch between them. Contexts typically consist of an aggregate of several components, which we have called `fields' -- to parallel the fields within documents. Different contexts may share some contextual fields. For example the current time will be the same for nearly all of them: however an exception to this would occur if the traveller set their time to a pretended value, representing a future time at which they plan to travel. To meet the need for multiplicity, an application can maintain not one context, but several.

POSSIBLE NEW APPLICATIONS

In this final part of the paper, we postulate one piece of novel infrastructure plus some applications that cover wide parts of one or more dimensions of the spectrum that we have explored above. Our postulated applications represent combinations of existing applications, e.g. C + D + a dash of E and F. In such cases we will somewhat arbitrarily take one component as a starting point, C say, and build from there. A key to these applications is a piece of infrastructure that we will now describe; its purpose is to give the user greater freedom to manipulate documents they are reading, such as web pages, and, as a beneficial side-effect, thereby to allow a computer program, by looking at the user's manipulations, to understand better what the user is interested in. This is the read/write interface.

We have said that an important part of the context is the document the user is currently reading or writing. For a read/write interface we assume that the tools for writing, such as text editors, also provide reading: obviously the user can read what they have written, but in addition almost any editor allows the user to import other documents -- thus the editor can be used as a crude file-browsing system. More unusually we assume all reading tools, such as web browsers, allow annotating (i.e. writing) too; for example the user can change words in the current document they are reading, add new words, delete words, etc. (These annotations may be ephemeral -- being lost when the user moves to a new document -- or they may be preserved in some way. We are not assuming a read/write interface offers all the power -- and complexity -- of a full authoring system.) Overall we have a concept of the current document of interest, which the user may read and write. One extreme is where the document being read is null, and thus the user is writing a new document; the other extreme is where a passive reader reads an existing document without annotating it. The most interesting cases come in between. In addition we assume that the user has a facility for feeding back their level of interest in each component of the current document: at its simplest this can be a facility for the user to highlight the words or sections that most interest them -- here we may have a simple Boolean division, where highlighting means "very interested" and lack of highlighting means "not very interested". Further we assume that the current document can contain hypertext links. Finally we assume that details of the current document, such as what parts the user wrote and what parts they have highlighted, are available to outside applications. Overall we feel that annotation is a natural aid that any user can employ to help their reading of the current document; if this annotation is known to other tools, which can as a result retrieve documents that are more relevant to the user, it is an almost free bonus. We can extend the concept of the document of interest to allow several simultaneous current documents of interest if we need to.

We believe that the read/write interface is a simple and powerful aid to allow the user to influence the retrieval process. We now postulate the applications. They exploit a read/write interface, and they also cater for multiple contexts, as introduced in the previous Section.

SUPERIF

Our first example starts from IF (Information Filtering), and our postulated application is called SUPERIF. As with normal IF the query is supplied in advance. Instead of working from a pre-defined document collection, SUPERIF uses a resource-discovery agent to find documents that meet the users' needs, as given by their current query. When a user first employs SUPERIF, he may optionally choose to set an initial query; in any case whatever query the user supplies is automatically supplemented by SUPERIF. This supplementing is done by a process of deduction from looking at each individual user's retrieval behaviour, and evolving their query continually (e.g. daily or weekly). Even in our make-believe world, however, it would be unrealistic to expect this process of query deduction and evolution to work well all the time. Thus the user will sometimes want to intervene, and to modify the query that has been automatically constructed for them; hence the application must be able to present the query to the user in a comprehensible and easily changeable form.

In line with our earlier comments about multiple contexts, SUPERIF creates several queries for each user, one for each hat they wear. SUPERIF has proactive delivery of information, but is radical in the way it delivers retrieved documents. It only delivers documents when the user is detected as being in a context where `they have the right hat on'. Thus if a query related to research papers, this would be delivered when the user next read or wrote a document that related to that activity. (There might be some special process for urgent documents: e.g. delivery with any hat on.) SUPERIF presents the set of delivered documents in a read/write interface, so that the user can make annotations or changes to aid selection or subsequent retrieval.

SUPERIF should cater for both short-term and long-term retrieval needs. Movement on the spectrum between short and long term can be achieved by adjusting weightings in the use of context. Thus for short-term needs the current context has a higher weighting than history. For long-term needs the current context is less important, and history more important.

SUPERIR

Our second application, SUPERIR, starts from IR as a base. SUPERIR only does context-aware retrieval. Since part of the context is the current document of interest, the user can simulate a current search engine by creating a new null document and just typing some search terms into it. (At an extreme they could ask for all other aspects of their context to be shut out, thus relying solely on the search terms they have just typed.) SUPERIR can be used as a web browser, and indeed a typical pattern of usage may be as follows: the user loads a document, and, if this is a hypertext document, perhaps follows a series of one or more links to further documents: when the user feels they want a broader perspective, they mark the parts of the current document that are especially relevant to their needs, and also perhaps add some annotations: they then hit a "Retrieve" button to cause SUPERIR to find some more relevant documents. SUPERIR, like SUPERIF, caters for multiple contexts. SUPERIR is geared to the user's short-term needs, and to cases where the query is not supplied in advance but interactively on-the-fly.

DISCUSSION OF SUPERIR AND SUPERIF

In designing our two hypothetical applications, we could have used various criteria for distinguishing them, e.g.: (a) one for short-term needs and one for long-term, or (b) one for proactive delivery and one for interactive. We do not, however, believe that either of these is fundamental. Instead the criterion we have used is whether queries are specified in advance, or whether they are supplied interactively on-the-fly. In a sense this criterion is not fundamental either; however we believe it is fundamental to having any chance of an efficient and hence usable application that caters for a large number of documents. Thus we believe that an implementation-based criterion must, at least in the foreseeable future, take precedence over others. The criterion is part of the difference between IR and IF, and the whole key to each of these has been to provide optimisations based on the parts (the queries or the document collection) that are known in advance. To be practical we think therefore that SUPERIR, which has dynamic queries, would have to have some advance knowledge of the document collections that might be used, and their content. Obviously, however, there will be specialised applications outside the scope of SUPERIF -- e.g. a traffic information system in which all documents are indexed by location and content is changing continually -- where high performance needs to be achieved in a totally dynamic world.

Obviously many variants of SUPERIF and SUPERIR are possible, but we believe that more important than the tools themselves is the underlying read/write interface. Perhaps the key to a whole range of advances is to escape from the legacy that reading documents and writing documents are separate activities.

CONCLUSIONS

At the start of this paper we quoted the view -- a view widely held -- that, in many real situations, following hyperlinks is too restrictive whereas using a general IR search is too permissive. The user wants something in between the extremes, but there is an added, sometimes implicit, requirement that this something must require no more user effort than the extremes, and ideally should require less effort. A key to achieving this is automatic processes that select or enhance the query, choose the document collection, and perhaps proactively deliver documents. The most natural way to accomplish this is to collect and exploit information about the user's context, and to use this in the automatic process. A further aid, affecting both convenience and performance, is to perform some stages of the retrieval process in advance.

Automatic processes, intended to work on the user's behalf, are, however, a two-edged sword. If the automatic processes deviate from the user's real needs, then, since the average user is unaware even of the existence of the automatic processes, he will have great trouble in correcting the problem. More generally, we have all had problems trying to tame too-clever-by-half software. A key to improving this is for the software to supply the why as well as the what. In terms of our retrieval model this is especially relevant in the construction of queries (see our above suggestion that SUPERIF explained why its queries had been constructed) and at stage 3: the stage where the user selects from the documents that have been retrieved. As well as presenting what has been retrieved, an application that uses a lot of automatic processes needs to explain, on option, why each document has been retrieved , e.g. `this document describes the XXX museum; it relates to your [deduced] interest in YYY, and represents a suggested afternoon activity given the forecast for rain'.

Currently we have a plethora of retrieval tools, each representing a point on a spectrum. The future surely lies in (1) making these tools work together in a seamless way, and (2) making each tool cover a wider part of the spectrum. A ubiquitous adoption of a read/write interface is, we believe, an aid to (1). As regards (2), we have proposed two tools, SUPERIR and SUPERIF, to this end; each knows some information in advance, and this offers a chance for a practical and efficient implementation that caters for document collections of a realistic size.

Finally, as well as looking at tools -- which has been a focus of this paper -- we need to look at fundamentals. We need to understand a human's searching strategies, for example by following and expanding Marchionini's models. Moreover we need to think about document models: reading and writing need to be treated as complementary, and our suggested read/write interface is aimed as a start to achieving this. Our current document models draw far too many unnecessary divisions: between different types of application, between reading and writing, between paper and electronic form. Radical new models, combined with good engineering of applications, are a key to a newer generation of software tools that are much less constrained than the current generation.

ACKNOWLEDGEMENTS

A lot of new insights relating to this paper were provided by Wendy Hall and Les Carr, and their colleagues at Southampton University. Douglas Tudhope helped hugely both with initial drafts and the final version, and gave pointers to relevant areas of research that were new to me. I am also grateful to two anonymous referees.

REFERENCES

HALL, W. The button strikes back, The New Review of Hypermedia and Multimedia, 6, 2000, 5-17.
HALL, W., DAVIS, H.C. and HUTCHINGS, G.A. Rethinking hypermedia: the Microcosm approach, Kluwer Academic Press, 1996.
CONKLIN, J. Hypertext: introduction and survey. IEEE Computer, 20, 9, 1987, 17-41.
OLSTON, C. and CHI, E.H. ScentTrails: integrating browsing and searching in the world wide web, Xerox PARC User Interface Group Technical Report, 2002.
CUNLIFFE D., TAYLOR C. and TUDHOPE D. Query-based navigation in semantically indexed hypermedia. Proceedings 8th ACM Conference on Hypertext (Hypertext '97), 1997, 87-95.
MARCHIONINI, G. Information seeking in electronic environments, CUP, 1995.
WATERWORTH J. and CHIGNELL M. A model for information exploration. Hypermedia, 3(1), 1991, 35-58.
KIDUK YANG. Combining text-, link-, and classification-based retrieval methods to enhance information discovery on the web, Ph.D. thesis, Univ. of North Carolina, 2002.
XML pointer language (Xpointer) version 1.0, http://www.w3.org/TR/xptr
STAFF, C.D. The HyperContext framework for adaptive hypertext. Proceedings of Hypertext 2002, College Park, Maryland, 2002, 11-20.
OBERLANDER J., O'DONELL M., KNOTT A., and MELLISH C. Conversation in the museum: experiments in dynamic hypermedia with the intelligent labelling explorer. New Review of Hypermedia and Multimedia, 4, 1998, 11-32.
NOT, E., PETRELLI, D., SARINI, M., STOCK, O., STRAPPARAVA, C. and ZANCANARO, M. Hypernavigation in the physical space: adapting presentations to the user and to the situational context. New Review of Hypermedia and Multimedia, 4, 1998, 33-45.
WEAL, M.J., MILLARD, D.E., MICHAELIDES, D.T. and DE ROURE, D.C. Building narrative structures using context based linking. Proceedings of the 12th ACM Conference on Hypertext (Hypertext '01), Aarhus, Denmark, 2001, 37-38.
HITCHCOCK, S., CARR, L., JIAO, Z., BERGMARK, D., HALL, W., LAGOZE, C. and HARNAD, S. Developing services for open print archives: globalisation, integration and the impact of links. Proceedings of the 5th ACM Conference on Digital Libraries, San Antonio, Texas, 2000, 143-151.
SALBER, D., DEY, A.K. and ABOWD, G.D. The context toolkit: aiding the development of context-enabled applications. Proceedings of CHI'99, Pittsburgh, PA, ACM Press, 1999, 434-441.
NEWMAN, W.M., ELDRIDGE, M.A. and LAMMING, M.G. Pepys: generating autobiographies by automatic tracking. Proc. ECSCW `91, Amsterdam, September 1991.
BATES, M.J. After the dot-bomb: getting information retrieval right this time. First Monday, 7(7), July 2002, Available through http://firstmonday.dk, 2002.
DAVIS, H.C., HALL, W., HEATH, I., HILL, G.J. and WILKINS, R.J. Towards an integrated environment with open hypermedia systems. Proceedings of the ACM Conference on Hypertext: ECHT92, ACM Press, 1992, 181-190.
SINCLAIR, A.S., MARTINEZ, K., MILLARD, D.E and WEAL, M.J. Links in the palm of your hand: tangible hypermedia using augmented reality. Proceedings of the 13th ACM Conference on Hypertext (Hypertext '02), College Park, Maryland, 2002, 127-136.
RHODES, B.J. The Wearable Remembrance Agent: a system for augmented memory. Personal Technologies, 1, 1997, 218-224.
ZELLWEGER, P., CHANG, B. and MACKINLAY, J. Fluid links for informed and incremental link transitions. Proceedings 9th ACM Conference on Hypertext and Hypermedia (Hypertext '98), 1998, 50-57.
RHODES, B.J. and MAES, P. Just-in-time information retrieval agents. IBM Systems Journal, 39, 4, 2000, 685-704
PRICE, N., GOLOVCHINSKY, G. and SCHILIT, B. Linking by inking: trailblazing in a paper-like hypertext. Proceedings 9th ACM Conference on Hypertext and Hypermedia (Hypertext '98), 1998, 30-39.
DEROSE, S.J. Expanding the notion of links. Hypertext '89 Proceedings, ACM Press, 1989, 249-257.
MOREAU, L. and HALL, W. On the expressiveness of links in hypertext systems. Computer Journal, 41, 7, 1998, 459-473.