Research issues in context-aware retrieval: the document collection

Peter Brown

Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk

ABSTRACT

Todo

Some basic assumptions

  1. We assume a document collection is a sequential file of stick-e notes.
  2. We assume that the Matcher can taken as input one or more document collections. (since concatenating two document collections gives a document collection, the multiple case can be done outside the Matcher if necessary.) The Matcher is not responsible for eliminating duplicates in cases where the same document is loaded more than once.
  3. Creation and maintenance of the document collection is not the concern of the Matcher. (With our existing tourism data, we generated the document collection automatically from an external source, and this will probably often be the case.)
  4. The Matcher loads its document collection(s) and then works with a derived internal form stored on its heap. It is not responsible for monitoring any changes to the original document collection, and so such changes will only come into effect when the Matcher is re-started. This assumption about re-starting the Matcher is fine for most experimentation and for practical use when, say, the collection is only updated once a day. However it may be useful to our research programme to investigate dynamic document collections; we discuss below how this might be done.

Dynamic aspects

Given the above assumptions the only practical way to deal with dynamic changes in the document collection is for the user to feed them directly to the Matcher. (The user will also almost always want to do parallel updates to the document collection files.) Thus the Matcher needs to have a facility to load new documents dynamically, and to delete existing documents.

Todo: unique IDs: if we have a single static document collection, it is reasonable to require the user to place some ID on each document in the collection if he needs . In the dynamic case it is probably easier for the Matcher to keep a counter, which it increases by one every time a document is loaded, and is attached as an extra field to each loaded document. Think more about this.

Relation to the history archive

If we have a history archive that takes the same format as a document collection, and if we have an updating tool for maintaining the history archive, either the Matcher or the user might find a use for this tool for maintaining document collections too.