Lifelong annotation: a working paper

P.J. Brown
Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
e-mail: P.J.Brown@exeter.ac.uk

ABSTRACT

A valued working practice for a researcher is to annotate each paper document they read. Several of these documents, with their annotations, may be kept over the researcher's working life. Some of the annotations may prove valuable, many years after they were written, when the researcher refers to the document again; other annotations, of course, will be wasted in the sense that they will be created but never again looked at. This note looks at corresponding annotations produced electronically rather than on paper. It also looks at the infrastructure needed to support a body of on-line annotations, and to extract the maximum value from them.

One-off vs. lifelong annotation

The nature of annotation tasks varies over a spectrum whose end points are:

  1. One-off annotations. A sample one-off task is annotating a draft document and returning the document to its author, so he can make corrections.
  2. Lifetime annotations. Here the user might be a researcher who wants to annotate every document she reads that may be relevant to her research. The ultimate potential use of the annotations may be well-defined, e.g. a thesis that is to be written, or unknown ("I may need this in the future"). In the latter case, in particular, annotations might be made over the lifetime of the researcher, and used -- or selectively culled -- at various times over a long period.

Obviously one-off annotations can be rather more sophisticated than the simple example quoted. They may involve two or more people collaborating in writing annotations, they may be used over a reasonable period of time, and they may involve a set of documents rather than just one. Nevertheless, even with all these extensions, the scenario is a long way from the all-encompassing needs of lifelong annotations.

Here we focus on lifelong annotations, and the extra infrastructure needed to support them. In particular this infrastructure must help exploit the advantages of electronic annotations over paper ones. In particular this infrastructure must cover the management and retrieval of annotations, and must be robust enough to deal with massive numbers of annotations, and with change over time. All these are needs over and above the needs of one-off annotations. We are talking throughout about a electronic world where documents are read and annotated on-line. We will, however, start by relating this to the paper world.

Use of paper

Increasingly researchers obtain their documents on-line. Nevertheless if they want to read a long document they will usually print it and read the paper version.

Researchers have always been accustomed to annotating the paper documents they read. Annotation is hugely easier on paper than on-line. This is partly because on-line annotation systems are currently crude and limited in scope; nevertheless even when the state-of-the-art improves, such systems are unlikely to approach the convenience of creating annotations on paper. Moreover annotating on-line normally implies reading on-line too, rather than printing out documents on paper. On-line annotations, however, have one big advantage over paper annotations: they can be searched electronically. This advantage may outweigh the disadvantage of the extra effort needed to create annotations on-line, and generally of reading a paper on-line rather than on paper. Moreover on-line annotations may in future support extended uses of annotation, combining the traditional concept of annotation with facilities found in hypertext systems and web browsers, such as:

More ambitiously, annotations might provide a novel way of allowing a document to come in different versions: each version is captured by a different set of annotations, and the user selects the set(s) they wish to see. This works best if annotations can be presented as embedded in the underlying document, perhaps indistinguishable from the original when the user reads the document.

Infrastructure needs

Data types

A prime need, if the infrastructure is to be usable for tens of thousands, or even hundreds of thousands, of annotations, is for the user to be able to attach properties to each annotation they make. A computer science concept for doing this (and other things too) is that of data type. Here we will assume every annotation has a data type. Each user will have their own supply of data types; sometimes these will be default system-created ones, but more often they will be created by the user themselves to reflect their own interests and needs. Examples are:

During their career, users will continually be creating new data types -- and refining/generalising their existing data types -- particularly when starting a new project or interest.

Data types in programming languages can either represent single scalar values, such as integers or strings, or can be structures built up of a number of other data types, e.g. a string together with an integer. In more sophisticated systems these structures can contain sub-structures. In Object-Oriented technology, concepts of inheritance between different data types can be applied. All these mechanism are likely to be useful in data types for annotations.

Assuming an annotation has a data type, this should capture the following three components:

The data type to capture an annotation might therefore be a structure with three fields, one field for each of the above. The content field might be of type Multimedia. Each field may itself be structured into sub-fields. If annotations are stored singly -- see later discussion -- there may be further fields to specify the ID of the underlying document and a timestamp for the annotation. On the other hand, if annotations on a document are grouped together, these fields may be attached to the grouping.

Data types may also be attached to the sub-parts of an annotation, rather than to a whole annotation. Furthermore, data types can be regarded as a mechanism for supplying metadata. In the rest of this paper we will use the concept of metadata, since it is more general.

Relation to context-based retrieval

The main purpose of associating metadata with annotations is to aid subsequent retrieval. (This applies especially in personal annotation systems, since such metadata concepts as "supporting argument" and "counter-argument" among a group do not apply.) Thus concepts found in context-based information retrieval are highly relevant since the aim is the same: adding something to make retrieval easier. The context may be regarded as metadata that can be attached to an IR query and/or to all or part of the documents to be retrieved. Context may be physical, such as location, or may relate to the user's activity. In relation to the latter Toms talks of domain (health, travel, etc.), user's task, and genre. Järvelin and Ingwersen talk more generally of "dimensions" of context. Sometimes contextual elements may be derived automatically, e.g. physical location, and sometimes it may be supplied explicitly by the user. For further details see the proceedings of the SIGIR workshop from which several of the references below are taken.

All of the above is relevant to metadata for annotations. In our work so far we have not investigated the automatic supplying of context, apart from the timestamps we have attached to annotation files. Our thinking on the user's context has centred on projects or jobs (hats), rather than on short-term tasks or topic domain. A project may, of course, cross several domains. This approach is based on hunch: we have no evidence yet that it is better or worse than the others. In our implementation we have given users a great degree of freedom to create data types as they wish. Evidence from Kelly shows considerable variation in what users call a task or project: in her experiments one user said they divided their time (over a semester) between 6 tasks, whereas another user said they divided among 35.

Discipline

A mantra in computer science is that if an artefact is really large, the creators will need to follow some discipline in creating it. If there is no discipline, either complexity or the effects of a combinatorial explosion will make the artefact unusable and unmaintainable.

Generally humans hate discipline. For example software tools, e.g. certain CASE tools, that force people to work to a certain discipline tend to be unpopular with programmers and require coercion from management. In the case of an annotation, if the future use, if any, of that annotation is unknown at the time of creation, the user will be reluctant to provide a lot of details about it.

Nevertheless a creator of a body of lifelong annotations will benefit from discipline. The discipline will make it easier to classify annotations and to retrieve the right ones when they are needed. This discipline may involve, at the time an annotation is created, recording a lot of properties to capture its nature. Many users, however, would not accept a strong discipline, and, assuming there is no coercion, would shun any annotation system that imposes it.

Thus designers of an annotation infrastructure need to tread carefully.

As regards data type -- which are one aspect of discipline -- the simplest approach is to have a single data type, Annotation, whose properties field consists of a single string. The user types the value of the string, which describes the nature of the annotation. If the user is reluctant even to do this, a default value "general" could be supplied.

A slightly more disciplined approach is for the user to choose among a set of pre-defined values for the properties field, e.g. For-thesis, For-ACM-paper, Citation, etc., as we used in a previous example. There should be a simple mechanism for the user to add new values to the list of pre-defined values.

A more general approach is to allow a single annotation to have more than one of these pre-defined property values, e.g. to be both For-thesis and Citation.

Going a stage further, an extension to this, requiring no extra work from the user, is for the Annotation properties field to contain a sub-field, which records the date at which the annotation was created (or last changed). This field would be filled in automatically by the system.

Repository

We assume that, when a user creates a set of annotations for an underlying document, this set of annotations is stored as a document that is separate from the original document. We call this an annotation-document. The annotation-document will contain the ID of the underlying document and the set of annotations. (Alternatively each single annotation could be stored separately -- see later discussion.) The form of the ID will depend on the nature of the underlying document, but most commonly it will be a URL. Obviously it helps if the underlying documents are all of the same nature, e.g. all web pages, but if the aim is to cover all the documents the user reads, this simple uniformity will not hold.

(The above assumption of separating annotations from underlying documents is not an absolutely necessary one. As massive amounts of storage become increasingly cheap, it becomes possible, when one wants to record a set of annotations, to make a copy of the underlying document, embed the annotations in it, and treat the whole thing as the annotation-document. At least for textual documents, this may be a viable possibility, even when massive numbers of documents are involved.)

For one-off annotations, all that is needed is an annotation-document, but for lifelong annotations there is a need for a repository to bring all the annotations together. Otherwise they will be unmanageable. Repositories may be personal or may be shared among a group of people. Here we will assume they are personal.

Basic units in the repository

The basic unit in which the repository deals will probably be a single annotation. However we currently favour an approach where, as the user sees it, a set of annotations for a certain document are stored together in a single document called the annotation-document. (We currently favour a model where committing an annotation-document to the repository does not cause a copy to be made, but instead causes links to be created: it is an implementation matter whether the annotation-document actually links to annotations held in the repository or vice versa. This has implications if the annotation-document is changed, especially if it is changed via a text editor, rather than the normal annotation tools -- which can keep track of what is happening.) The model is then that the user (1) loads a document; (2) makes a set of annotations; (3) commits this set of annotations, thus storing them in the repository/annotation-document. (For an editing-of-annotations run the sequence is (1) load a document, plus its previous set of annotations; (2) add/delete/amend annotations; (3) commit the new set of annotations, replacing the previous set.)

Uses of the repository

A typical use of the repository will be to perform some operation on all those annotations that meet certain conditions. A condition may, for example, be some combination of the following:

Sample operations to be performed may be find, delete, or, perform-this-esoteric-task-designed-by-the-user. As this sample list implies, some common operations (e.g. find) will be built into the repository software; for other operations (e.g. the esoteric ones) the repository should have the capability of supplying a set of annotations, probably one by one to a program supplied by the user. As one example, this program might be a conversion program that caused the annotations supplied from the repository to be deleted and replaced by annotations of the converted form. For instance the program might take all annotations of type Citation and convert them to annotations either of type Research-citation or Tutorial-citation. (We assume the program has some clever algorithm which looks at the underlying document for the citation and decides whether it is a research paper or a tutorial.)

As a second example the program might create a draft References section for a paper, using a given set of annotations of type Citation.

The user will also want the repository to help him with management of change, both within the repository itself and within the data in the repository (see subsequent Section). At the very least, given that most of us do something really silly at least once a month, it must be hard to accidentally delete the whole repository, or fill it with an infinite amount of garbage. Similar requirements apply, of course, to most existing databases and other repositories.

Committing to the repository

At the end of an annotation session the user will typically be asked if they want to commit their annotations to the repository. Some types of annotation may be essentially ephemeral, and committing them to the repository might eventually lead to a mountainous amount of junk being preserved. To combat this, one aspect of each annotation data type might be whether it was ephemeral or persistent.

To mirror the commit facility there needs to be an uncommit facility. This raises a number of questions. Assuming the user uncommits a document which is a file, does uncommit search for all references to the file in the repository (which, depending on how the repository is organised, might take a long line), or does it just look at the annotations that are currently in the file (and are currently not ephemeral)? Does every commit, before it starts, perform an uncommit, in order to remove all annotations that related to the previous content of the file? Does any change in the list of ephemeral contexts precipitate a full search of the repository to do the necessary commits and uncommits? Should there be periodic (nightly?) checks of the repository to ensure its consistency?

In a prototype experiment we did to create a repository, an annotation document was stored in a UNIX file, and the filename (as a full UNIX path) was remembered in the repository. A problem with this approach is that filenames change -- even more frequently than the URLs of web pages. Hence the approach does not live easily with the epithet "lifelong".

When committing a document, ideally its whole environment should be recorded. Most importantly the data types it uses need to be preserved, as the user's data types will almost certainly evolve over time. If a document is a hyperdocument, ideally the documents it links to should also be preserved (especially if the document is simply a `banner' to enter some other site); in general, however, preserving all such information is obviously infeasible.

We assume each user has only one repository.

Catering for change

It is possible that an underlying document that the user has annotated may change in the future. The user can have two possible reactions:

The first case is a notorious problem for annotations. It arises either because the underlying document has moved, and thus has a new ID, or because its content has changed (the overwriting case mentioned above). When the content has changed, all the annotation anchors are likely to be wrong; various methods can be used to alleviate the effects of this, but deep down it is an insoluble problem. Users just have to live with it.

In general preserving information, together with its environment, for a lifetime brings tough legacy issues.

References

A A A A A