Research issues in context-aware retrieval: the document collection
Peter Brown
Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk
ABSTRACT
Todo
Some basic assumptions
-
We assume a document collection is a sequential file of stick-e notes.
-
We assume that the Matcher can taken as input one or more document collections.
(since concatenating two document collections gives a document collection, the
multiple case can be done outside the Matcher if necessary.)
The Matcher is not responsible for eliminating duplicates in cases where the same
document is loaded more than once.
-
Creation and maintenance of the document collection is not the concern of the Matcher.
(With our existing tourism data, we generated the document collection automatically from
an external source, and this will probably often be the case.)
-
The Matcher loads its document collection(s) and then works with a derived internal form stored
on its heap.
It is not responsible for monitoring any changes to the original document collection, and so
such changes will only come into effect when the Matcher is re-started.
This assumption about re-starting the Matcher is fine for most experimentation and for
practical use when, say, the collection is only updated once a day.
However it may be useful to our research programme to investigate dynamic document collections; we discuss below how this might be done.
Dynamic aspects
Given the above assumptions the only practical way to deal with dynamic changes in the
document collection is for the user to feed them directly to the Matcher.
(The user will also almost always want to do parallel updates to the document collection files.)
Thus the Matcher needs to have a facility to load new documents dynamically, and to delete
existing documents.
Todo: unique IDs: if we have a single static document collection, it is reasonable
to require the user to place some ID on each document in the collection if he needs .
In the dynamic case it is probably easier for the Matcher to keep a counter, which it
increases by one every time a document is loaded, and is attached as an extra field
to each loaded document.
Think more about this.
Relation to the history archive
If we have a history archive that takes the same format as a document collection,
and if we have an updating tool for maintaining the history archive, either
the Matcher or the user might find a use for this tool for maintaining document collections too.