Lifelong annotation: a working paper

P.J. Brown
Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
e-mail: P.J.Brown@exeter.ac.uk

ABSTRACT

A valued working practice for a researcher is to annotate each paper document they read. Several of these documents, with their annotations, may be kept over the researcher's working life. Some of the annotations may prove valuable, many years after they were written, when the researcher refers to the document again; other annotations, of course, will be wasted in the sense that they will be created but never again looked at. This note looks at corresponding annotations produced electronically rather than on paper. It also looks at the infrastructure needed to support a body of on-line annotations, and to extract the maximum value from them.

One-off vs. lifelong annotation

The nature of annotation tasks varies over a spectrum whose end points are:

One-off annotations. A sample one-off task is annotating a draft document and returning the document to its author, so he can make corrections.
Lifetime annotations. Here the user might be a researcher who wants to annotate every document she reads that may be relevant to her research. The ultimate potential use of the annotations may be well-defined, e.g. a thesis that is to be written, or unknown ("I may need this in the future"). In the latter case, in particular, annotations might be made over the lifetime of the researcher, and used -- or selectively culled -- at various times over a long period.

Obviously one-off annotations can be rather more sophisticated than the simple example quoted. They may involve two or more people collaborating in writing annotations, they may be used over a reasonable period of time, and they may involve a set of documents rather than just one. Nevertheless, even with all these extensions, the scenario is a long way from the all-encompassing needs of lifelong annotations.

Here we focus on lifelong annotations, and the extra infrastructure needed to support them. In particular this infrastructure must help exploit the advantages of electronic annotations over paper ones. In particular this infrastructure must cover the management and retrieval of annotations, and must be robust enough to deal with massive numbers of annotations, and with change over time. All these are needs over and above the needs of one-off annotations. We are talking throughout about a electronic world where documents are read and annotated on-line. We will, however, start by relating this to the paper world.

Use of paper

Increasingly researchers obtain their documents on-line. Nevertheless if they want to read a long document they will usually print it and read the paper version.

Researchers have always been accustomed to annotating the paper documents they read. Annotation is hugely easier on paper than on-line. This is partly because on-line annotation systems are currently crude and limited in scope; nevertheless even when the state-of-the-art improves, such systems are unlikely to approach the convenience of creating annotations on paper. Moreover annotating on-line normally implies reading on-line too, rather than printing out documents on paper. On-line annotations, however, have one big advantage over paper annotations: they can be searched electronically. This advantage may outweigh the disadvantage of the extra effort needed to create annotations on-line, and generally of reading a paper on-line rather than on paper. Moreover on-line annotations may in future support extended uses of annotation, combining the traditional concept of annotation with facilities found in hypertext systems and web browsers, such as:

bookmarks.
history.
paths.

More ambitiously, annotations might provide a novel way of allowing a document to come in different versions: each version is captured by a different set of annotations, and the user selects the set(s) they wish to see. This works best if annotations can be presented as embedded in the underlying document, perhaps indistinguishable from the original when the user reads the document.

Infrastructure needs

Data types

A prime need, if the infrastructure is to be usable for tens of thousands, or even hundreds of thousands, of annotations, is for the user to be able to attach properties to each annotation they make. A computer science concept for doing this (and other things too) is that of data type. Here we will assume every annotation has a data type. Each user will have their own supply of data types; sometimes these will be default system-created ones, but more often they will be created by the user themselves to reflect their own interests and needs. Examples are:

For suggested edits to a colleague's paper: Deletion, Insertion, Change. Possibly Change might be replaced by two alternatives: Vital-change and Suggested-change. As a more general approach Change could be given a Importance field, which was, say, an integer between 1 and 5.
When the annotation relates to needs known in advance, there could be data types such as For-thesis, For-ACM-paper, or, to meet a rather more general need, Citation.
When the exact need is not known in advance, the data types might relate to the various `hats' the user ways, sub-jobs they do, or outside outside interests, e.g. Exams-officer, Prospectus, European-funding or Golf.
The above has little use if nearly all the annotations come into one category, e.g. the user's role as a researcher, and where the majority of the annotations are of the form: this-might-be useful-some-time. Data types relating to research sub-topics may then be useful, e.g. Mobile-systems, Document-mark-up. Data types like this might be presented as sub-types of the data types relating to the user's hats.
Todo. a few more?

During their career, users will continually be creating new data types -- and refining/generalising their existing data types -- particularly when starting a new project or interest.

Data types in programming languages can either represent single scalar values, such as integers or strings, or can be structures built up of a number of other data types, e.g. a string together with an integer. In more sophisticated systems these structures can contain sub-structures. In Object-Oriented technology, concepts of inheritance between different data types can be applied. All these mechanism are likely to be useful in data types for annotations.

Assuming an annotation has a data type, this should capture the following three components:

some properties, capturing the type of annotation it is.
the anchor of the annotation: usually this will be a point or a fragment within the underlying document.
the content of the annotation: often this will simply be text. In many cases some meta-content, e.g. commentary about the reason for the content, may be interspersed with the content. For example, with an annotation that is a suggested change to a paper, the content of an Insertion annotation may have meta-content `because this has not been explained yet'. Thus generally the content, even if textual, might involve mark-up showing the different roles of various fragments of its text.

The data type to capture an annotation might therefore be a structure with three fields, one field for each of the above. The content field might be of type Multimedia. Each field may itself be structured into sub-fields. If annotations are stored singly -- see later discussion -- there may be further fields to specify the ID of the underlying document and a timestamp for the annotation. On the other hand, if annotations on a document are grouped together, these fields may be attached to the grouping.

Data types may also be attached to the sub-parts of an annotation, rather than to a whole annotation. Furthermore, data types can be regarded as a mechanism for supplying metadata. In the rest of this paper we will use the concept of metadata, since it is more general.

Relation to context-based retrieval

The main purpose of associating metadata with annotations is to aid subsequent retrieval. (This applies especially in personal annotation systems, since such metadata concepts as "supporting argument" and "counter-argument" among a group do not apply.) Thus concepts found in context-based information retrieval are highly relevant since the aim is the same: adding something to make retrieval easier. The context may be regarded as metadata that can be attached to an IR query and/or to all or part of the documents to be retrieved. Context may be physical, such as location, or may relate to the user's activity. In relation to the latter Toms talks of domain (health, travel, etc.), user's task, and genre. Järvelin and Ingwersen talk more generally of "dimensions" of context. Sometimes contextual elements may be derived automatically, e.g. physical location, and sometimes it may be supplied explicitly by the user. For further details see the proceedings of the SIGIR workshop from which several of the references below are taken.

All of the above is relevant to metadata for annotations. In our work so far we have not investigated the automatic supplying of context, apart from the timestamps we have attached to annotation files. Our thinking on the user's context has centred on projects or jobs (hats), rather than on short-term tasks or topic domain. A project may, of course, cross several domains. This approach is based on hunch: we have no evidence yet that it is better or worse than the others. In our implementation we have given users a great degree of freedom to create data types as they wish. Evidence from Kelly shows considerable variation in what users call a task or project: in her experiments one user said they divided their time (over a semester) between 6 tasks, whereas another user said they divided among 35.

Discipline

A mantra in computer science is that if an artefact is really large, the creators will need to follow some discipline in creating it. If there is no discipline, either complexity or the effects of a combinatorial explosion will make the artefact unusable and unmaintainable.

Generally humans hate discipline. For example software tools, e.g. certain CASE tools, that force people to work to a certain discipline tend to be unpopular with programmers and require coercion from management. In the case of an annotation, if the future use, if any, of that annotation is unknown at the time of creation, the user will be reluctant to provide a lot of details about it.

Nevertheless a creator of a body of lifelong annotations will benefit from discipline. The discipline will make it easier to classify annotations and to retrieve the right ones when they are needed. This discipline may involve, at the time an annotation is created, recording a lot of properties to capture its nature. Many users, however, would not accept a strong discipline, and, assuming there is no coercion, would shun any annotation system that imposes it.

Thus designers of an annotation infrastructure need to tread carefully.

As regards data type -- which are one aspect of discipline -- the simplest approach is to have a single data type, Annotation, whose properties field consists of a single string. The user types the value of the string, which describes the nature of the annotation. If the user is reluctant even to do this, a default value "general" could be supplied.

A slightly more disciplined approach is for the user to choose among a set of pre-defined values for the properties field, e.g. For-thesis, For-ACM-paper, Citation, etc., as we used in a previous example. There should be a simple mechanism for the user to add new values to the list of pre-defined values.

A more general approach is to allow a single annotation to have more than one of these pre-defined property values, e.g. to be both For-thesis and Citation.

Going a stage further, an extension to this, requiring no extra work from the user, is for the Annotation properties field to contain a sub-field, which records the date at which the annotation was created (or last changed). This field would be filled in automatically by the system.

Repository

We assume that, when a user creates a set of annotations for an underlying document, this set of annotations is stored as a document that is separate from the original document. We call this an annotation-document. The annotation-document will contain the ID of the underlying document and the set of annotations. (Alternatively each single annotation could be stored separately -- see later discussion.) The form of the ID will depend on the nature of the underlying document, but most commonly it will be a URL. Obviously it helps if the underlying documents are all of the same nature, e.g. all web pages, but if the aim is to cover all the documents the user reads, this simple uniformity will not hold.

(The above assumption of separating annotations from underlying documents is not an absolutely necessary one. As massive amounts of storage become increasingly cheap, it becomes possible, when one wants to record a set of annotations, to make a copy of the underlying document, embed the annotations in it, and treat the whole thing as the annotation-document. At least for textual documents, this may be a viable possibility, even when massive numbers of documents are involved.)

For one-off annotations, all that is needed is an annotation-document, but for lifelong annotations there is a need for a repository to bring all the annotations together. Otherwise they will be unmanageable. Repositories may be personal or may be shared among a group of people. Here we will assume they are personal.

Basic units in the repository

The basic unit in which the repository deals will probably be a single annotation. However we currently favour an approach where, as the user sees it, a set of annotations for a certain document are stored together in a single document called the annotation-document. (We currently favour a model where committing an annotation-document to the repository does not cause a copy to be made, but instead causes links to be created: it is an implementation matter whether the annotation-document actually links to annotations held in the repository or vice versa. This has implications if the annotation-document is changed, especially if it is changed via a text editor, rather than the normal annotation tools -- which can keep track of what is happening.) The model is then that the user (1) loads a document; (2) makes a set of annotations; (3) commits this set of annotations, thus storing them in the repository/annotation-document. (For an editing-of-annotations run the sequence is (1) load a document, plus its previous set of annotations; (2) add/delete/amend annotations; (3) commit the new set of annotations, replacing the previous set.)

Uses of the repository

A typical use of the repository will be to perform some operation on all those annotations that meet certain conditions. A condition may, for example, be some combination of the following:

of data type X.
with properties Y.
whose content field contains the string Z.
that was created between times T1 and T2.
that has ID1 as its underlying document.

Sample operations to be performed may be find, delete, or, perform-this-esoteric-task-designed-by-the-user. As this sample list implies, some common operations (e.g. find) will be built into the repository software; for other operations (e.g. the esoteric ones) the repository should have the capability of supplying a set of annotations, probably one by one to a program supplied by the user. As one example, this program might be a conversion program that caused the annotations supplied from the repository to be deleted and replaced by annotations of the converted form. For instance the program might take all annotations of type Citation and convert them to annotations either of type Research-citation or Tutorial-citation. (We assume the program has some clever algorithm which looks at the underlying document for the citation and decides whether it is a research paper or a tutorial.)

As a second example the program might create a draft References section for a paper, using a given set of annotations of type Citation.

The user will also want the repository to help him with management of change, both within the repository itself and within the data in the repository (see subsequent Section). At the very least, given that most of us do something really silly at least once a month, it must be hard to accidentally delete the whole repository, or fill it with an infinite amount of garbage. Similar requirements apply, of course, to most existing databases and other repositories.

Committing to the repository

At the end of an annotation session the user will typically be asked if they want to commit their annotations to the repository. Some types of annotation may be essentially ephemeral, and committing them to the repository might eventually lead to a mountainous amount of junk being preserved. To combat this, one aspect of each annotation data type might be whether it was ephemeral or persistent.

To mirror the commit facility there needs to be an uncommit facility. This raises a number of questions. Assuming the user uncommits a document which is a file, does uncommit search for all references to the file in the repository (which, depending on how the repository is organised, might take a long line), or does it just look at the annotations that are currently in the file (and are currently not ephemeral)? Does every commit, before it starts, perform an uncommit, in order to remove all annotations that related to the previous content of the file? Does any change in the list of ephemeral contexts precipitate a full search of the repository to do the necessary commits and uncommits? Should there be periodic (nightly?) checks of the repository to ensure its consistency?

In a prototype experiment we did to create a repository, an annotation document was stored in a UNIX file, and the filename (as a full UNIX path) was remembered in the repository. A problem with this approach is that filenames change -- even more frequently than the URLs of web pages. Hence the approach does not live easily with the epithet "lifelong".

When committing a document, ideally its whole environment should be recorded. Most importantly the data types it uses need to be preserved, as the user's data types will almost certainly evolve over time. If a document is a hyperdocument, ideally the documents it links to should also be preserved (especially if the document is simply a `banner' to enter some other site); in general, however, preserving all such information is obviously infeasible.

We assume each user has only one repository.

Catering for change

It is possible that an underlying document that the user has annotated may change in the future. The user can have two possible reactions:

he wants his previous annotations to be applied to the changed version of the document. For example the user might have annotated a document representing a table of protein names. When a new version of the document is issued (which might have new table entries added -- no problem -- or existing table entries changed -- a potential problem), the user wants the previous annotations to be converted (as far as practicable) to the new document. In such cases the new document has often overwritten the old one, and has the same ID; its creation date can reveal whether it is a new document.
he wants his annotations to apply to the original version of the document. In this case, if there is a possibility of future change, a copy can be made of the original document at the time the annotation is made. Once the copy has been made, change is no longer an issue.

The first case is a notorious problem for annotations. It arises either because the underlying document has moved, and thus has a new ID, or because its content has changed (the overwriting case mentioned above). When the content has changed, all the annotation anchors are likely to be wrong; various methods can be used to alleviate the effects of this, but deep down it is an insoluble problem. Users just have to live with it.

In general preserving information, together with its environment, for a lifetime brings tough legacy issues.

References

Järvelin, K. and Ingwersen, P. (2004). `Extending information retrieval and retrieval research toward context', Proc. ACM SIGIR 2004 Workshop on Information Retrieval.
Kelly, D. (2004). `Building a test collection for investigating contextual information retrieval', Proc. ACM SIGIR 2004 Workshop on Information Retrieval.
Toms, E.G., Bartlett, J., Freund, L., Dufour, C. and Szigeti, S. (2004). `Identifying the significant contextual factors of search', Proc. ACM SIGIR 2004 Workshop on Information Retrieval.

A A A A A