Remembering web pages visited

P.J. Brown Department of Computer Science, Harrison Building, Univ. of Exeter, Exeter EX4 4QF, UK
E-mail: P.J.Brown@ex.ac.uk

Abstract

Todo

Todo issues: why only web pages?; annotations found when page pops up (i.e. web page -> annotations automatically added); annotation involves running a program (like Zellweger's paths -- idea cited by Zellweger; can you tell if a page is dynamic, and, if so, remember it?; visual memory -- thumbnails; history is a log, because linking (and scrolling) is about all you can do with most browsers; dynamic aspects; ref to forget-me-not

Introduction

Users of the web often have a need to find pages that they have read in the past, and browsers offer many diverse mechanisms for doing this, e.g. history, bookmarks, annotations, paths, etc. In this paper we look at existing and potential mechanisms, and examine whether it is possible to combine them into a single general mechanism. Would such a general mechanism be a successful jack-of-all-trades or would it be a master of none? We also look at issues concerning remembering massive numbers of pages. Could a researcher, working on a 5-year project, mark all the web pages he reads and wants to remember (perhaps with an annotation attached), and then easily retrieve the ones he needs at any later date? Carrying this further, can we support "lifelong annotation" (Brown, 2004), where a user captures all the interesting pages he has read over his working life?

Some mechanismsfor remembering pages are designed for collaborating group; many annotation mechanisms have this property. In this paper, however, we are not concerned with the extra complexity that groupware brings, but instead focus on personal remembering, and the way this can be enhanced by public resources on the web; one such resource is the huge repository of remembered pages that a search engine supports.

Virtually every mechanism consist of three parts: recording the pages read, retrieving the pages that have been recorded, and maintaining a database of the recorded pages.

Some existing mechanisms

In Table 1 below we show some of the properties of existing mechanisms for remembering web pages. We have included two mechanisms whereby an author makes their remembered links available for the world to use. These two mechanism are (a) paths (also called guided tours, or, in Vannevar Bush's original paper, trails), where an author prepares a suggested route (a path of individual pages) through some more complex material, and (b) what we call "X's favourite links". The latter are commonly found on the web, in various guises, and assume other users follow the author's taste. The end-user of both these mechanisms may not themselves have visited the remembered pages: the aim is that the end-user's remembered pages are augmented by another person's -- the person we call "the author".

The first of the above mechanisms, paths, merits more discussion. Zellweger (1988) introduces a generalised idea of a path: a path consists, as is normal, of a sequence of fragments of documents, but each point in the path has a script associated with it. This script can perform any action. One such action is making a voice annotation; this might be used by someone commenting on a draft paper. Another possible action is to edit the document in which the fragment occurs: Zellweger's example is the source code of a program, where the path identifies places where extra debugging code can be inserted, and the script for each point on the path inserts the debugging code. Our table entry for "Path" reflects this potential generality.

The table entries show typical properties: obviously there are lots of different browsers, and many of these do not follow the herd, and thus have atypical properties.

TABLE 1: properties of existing mechanisms
Anchor Augmentation Structuring ID
Bookmark Page/fragment No User-designed hierarchy HTML <TITLE>
History Page/fragment No Time-based structure URL or HTML <TITLE>
Path Page/fragment Yes Author-defined structure Author's title or description
Annotation Fragment Yes Attached to a URL Often none
Notes (Opera) Normally a page Yes User-defined structure Yes
X's favourite links Page/fragment No Author-defined structure Name of source anchor

In Table 1, the Anchor column describes the nature of the anchor attached to the remembered page; in practice the anchor is set by default to a point at the beginning of the page (the table entry "page" covers this case), but it could also be a fragment within the page. The Augmentation column records whether the remembered page has some augmentation from the original, such as an added annotation. The Structuring column gives a typical structure of the underlying database; this structure is normally used in retrieval. The ID column records how the remembered document is named as an aid to the user to retrieve it, e.g. the name that appears in a bookmark list.

Issues of change

The time of retrieval may be some time after the time of remembering, and as time goes by there is an increasing likelyhood that the original document may have changed or its URL (assuming this is used to retrieve it) may be have been changed or deleted.

In some applications the user may want to remember the original page, even if there is subsequently a later version. This is obviously the case in applications where the user wants to keep a historical record, e.g. of press releases from a particular company. We call this case static remembering. In other applications, the user will always want to retrieve the most recent version of a page, for example a bookmark may mark a site that gives the latest news stories, and the user may want to use this to keep up-to-date. We call this dynamic remembering. We do not think it is generally possible to deduce, when a user remembers a page, which of the above two cases applies. Therefore the user herself must say. A reasonable default is dynamic remembering.

Static remembering can be implemented simply by taking a copy of the page to be remembered (and doing housekeeping such as attaching the correct <BASE> so that relative links work in the copy). This works well if the user's interest is confined to the page itself; if the user's interest also includes links from the remembered page, e.g. the link `first press release' in a press releases page, then problems mat occur. The URL on this link may lead to a page that gives today's first press release, rather than the first press release on the day the page was remembered. We see no general solution to this problem, though a partial solution is to save linked-to pages too, especially if they appear to be dynamic.

Dynamic remembering can be implemented just by remembering the URL. Obviously this has the dangers mentioned above, but users have come to live with these dangers. One way of alleviating the dangers is to remember both the URL and its current content, so if their are problems with the latter, the former can be used. Alternatively a lexical signature can be used in place of the current content, as in the method of Phelps and Wilensky (2000); this has the merit of being able to find a page if its URL has changed.

Problems of change of content are worse when the remembering involves anchors within the content. A particularly bad case is when the remembering involves annotation of the original page. Various techniques (e.g. Bernheim Brush et al, 2001, Phelps & Wilensky, 2000a and Röscheisen et al, 1995) make anchors robust over change, but inevitably such techniques break down if change is radical enough.

To summarise, change is an inevitable problem with using any archived material. However with remembered web pages there are several ways of alleviating the problems of change: change is usually a nuisance rather than a killer.

The database

ease of finding vs comprehensiveness (e.g. 1000 bookmarks); purging (automatic?); ordering (intelligent, e.g. order of favouritism -- but users do not like change);

Retrieval

The two issues for retrieval are, firstly, the internal method used to retrieve documents -- this becomes important if there are massive numbers of documents -- and, secondly, the related issue of the interface whereby the user specifies what she wants to retrieve. Currently each mechanism for remembering pages usually has its own method and user interface for retrieval, geared for the particular task it has been designed for. Thus a bookmark has a retrieval interface based on a pop-up menu (usually with a hierarchy of sub pop-ups); this is ideal for retrieving from a small set of possibilities, but would be hopeless if there were, say, 5000 bookmarks. A history mechanism typically has a retrieval structural based on time -- but perhaps also using URLs to identify pages.

For massive amounts of remembered pages, Information Retrieval (IR) techniques are needed for the internal retrieval mechanism. (Indeed, taking this to an extreme, a search engine can be regarded as a service that remembers all published pages for everyone, though many search engines have an emphasis on remembering current pages rather than archiving very old ones. To achieve their impressive speed, search engines use sophisticated IR techniques.)

References