Research issues in context-aware retrieval: databases versus IR

Peter Brown

Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk

ABSTRACT

The world of databases and the world of retrieval are unfortunately almost disjoint. When implementing a CAR system we need to choose which world to live in. If the underlying data is structured a database is certainly a possibility, and many CAR systems use them. There may, however, be important areas where databases do not meet CAR needs.

Issues

We list here some issues that might prevent us using a database:

  1. We want a ranked list of matches rather than the Boolean matching that databases offer, i.e. we want each match to be given a matching-score.
  2. We normally have information that is divided into fields (which is, of course, excellent for databases), but we may want to give some fields more weight than others. Overall we want the match of each field to be given a score, and then to have an algorithm that combines these field scores, giving each an appropriate field-weight, into an overall matching-score. We wish to have control over the design of this algorithm, in order to research different possibilities.
  3. We may want to design our own algorithms for setting a score as to how well two field values match. For example if we have two areas, each represented by a rectangle, our algorithm might use (a) the inverse of the square of the distance between the centres of the two rectangles, and/or (b) the extent to which the rectangles overlap. As another example, we may wish to have sophisticated algorithms for matching such things as text fields, images, etc.
  4. We may want to support both optional and compulsory matching of fields (e.g. location must match, temperature may match). (?Can databases/SQL do this?)
  5. In reality locations often represent areas, e.g. Devon, Exeter Cathedral Close. Although our current approach is to use rectangles, we may wish to move on to polygons, etc. (?GIS systems may be the answer here?)

Fuzzy matching

One way of adapting databases to meet some of our needs is to have a fuzzy matching front-end. A few commercial products support fuzzy matching as a front-end to any standard SQL database. My guess is that they work as follows. Assume the requirement is to match Location L and Temperature T. The fuzzy-matcher first tries to find exact matches, and, if it finds any, gives them top matching-scores. It then tries "near-miss" matches, e.g. a temperature between T-1 and T+1, then a temperature between T-3 and T+3. Each is given a matching-score according to how near the miss. Scores of individual fields are accumulated (and weighted, if desired) to give an overall matching-score for a database record.

This approach has been used by Rhodes, using the Fuzzy-Matcher product, marketed by Sonalysts Inc. (?but do they still support it?). Rhodes used a logarithmic scale. Fuzzy queries is also discussed in a more general paper by Teska.

Conclusions

We are researchers, and want to investigate many possibilities, unconstrained as far as possible by the tools we use. For this reason, I think our current approach of implementing our own matching engine, and living outside the world of databases, is the best one.

Some relevant papers

1.
K. Teska. `Fuzzy Logic Query Processing'
2.
B.J. Rhodes, `The Wearable Remembrance Agent: A system for augmented memory', Personal Technologies, 1 .