Peter Brown and Gareth Jones
Department of Computer Science, University of Exeter, Exeter EX4 4QF,
UK
P.J.Brown@ex.ac.uk, G.J.F.Jones@ex.ac.uk
Todo
We suggest certain experiments that might be performed; as a notational convention, we introduce these with italics, e.g. `suggested experiment:'.
Our research has thrown up a host of speculative ideas for improving the performance of CAR, and to progress we must have a realistic way of evaluating the success or otherwise of the implementation of these ideas. Unfortunately CAR evaluation is hard. This is for (at least) four reasons.
Gareth: Actually I'd go as far as to say that our research is to a large extent actually defining what CAR is, from which point we are trying to methods to do it well. Which suggests to me that we need to understand what we regard as the baseline in some sense, so that we can see how our ideas improve things. I return to this point later on, but I'm starting to think that we have a complex situation combining issues in absolute retrieval accuracy (as measured by precision and recall), with system issues (such as how long to do something, how much data needs to flow around the system, etc - caching and the push/pull arguments fit in here), and user issues (where the context-of-interest is a key factor). In order to address these interrelated, but rather separate issues we need to take the evaluation strategy apart carefully.
The first reason is the dissimilarity from a laboratory environment. IR is traditionally evaluated by means of laboratory experiments. Robertson [2] highlights the deficiencies of such experiments:
".. there are many issues in IR system design that seem to fall [between] the conditions obtaining in a laboratory and those real-life operational conditions under which systems are actually used. Perhaps the most glaring example relates to the study of highly interactive systems and of the process of interaction, or even more broadly in the study of users and the task contexts in which information seeking takes place. Designing an experiment to answer useful research questions involves balancing the need for laboratory-style control against the need for operational realism."
These difficulties are particularly strong for CAR. Unlike for traditional IR, the CAR user is typically mobile, and the static conditions of a laboratory experiment do not remotely apply. Thus to evaluate CAR we really must do some real experimentation in the field, though there are still some aspects that can be evaluated by laboratory-style experiments. Unfortunately real field testing is hugely slower and more expensive than laboratory testing. It is therefore simply unrealistic to perform a field study every time you have a new idea that needs evaluation, e.g. a new matching algorithm or a tweak to an existing one; instead we must devise some laboratory-style experiments, even if crude, to be used on a day-to-day basis, and only move on to field trials when we have a reasonably solid system to investigate.
Gareth: I think we need to distinguish carefully between the physical mobility of the user (and how they physically interact with the retrieval system) and the physical mobility of the user (and how this mobility relates to change of context and consequent change of query and the related relevant documents). The issue of how user's interact physical with a context-aware retrieval system is related to traditional interactive search (although evaluation will almost certainly be more difficult). The issue of how the query and relevance set changes due to mobility can, I think at this point, be related strongly to the traditional query and relevance set model. I suggest that at this point we concentrate only on the query/relevance set side of things. That is, try to start with a scenario that is as like the traditional non-interactive laboratory experiment as possible. Peter: I agree with this, I think, but the first sentence is not clear: distinguishing between the same thing. One example that highlights the difference aspects of mobility would be a static user retrieving information related to moving objects, e.g. a vehicle fleet co-ordinator. Apr 4: Gareth: I was trying to observe that the user does not have to be physically mobile using a CAR system in order to perform quantitative CAR experiments. The user interacting with a CAR system in the field is a different experimental environment. The example of the moving object is an intesting one; I think that this essentially corresponds to a continuously changing document set, and we should probably consider this case as a context-aware filtering task: the user and their profile (including context) are not changing, but the document (the car) and its features (in this case its context) are changing. This is a nice example of how prediction of movement might enable us to minimise the number of filtering actions for this changing document.
The second reason CAR evaluation is hard is as follows. CAR is typically continuous, and indeed our research has concentrated on exploiting this. Many of our methods involve an analysis of past change (the context diary) and forecasting the future. Traditional IR evaluation is based on single stand-alone queries. For CAR any reference set would need to take account of the history of context change; it is uncertain whether a reference set can be created at all, partly because the expertise does not exist to create it.
Gareth: Clearly in practice the query is changing (as we have noted elsewhere) almost continuously. How to deal with this from a practical retrieval (making sure that the best set of relevant documents is available, caching etc) and the HCI issues (e.g. how often to update the screen, context-of-interest) have to some extent also been examined by Rhodes.
Gareth: We need to avoid being confused by the distinction between (a) what the relevant documents are for a particular query and their ranking to the user and (b) the efficiency/speed of delivery. Essentially tools such as the context-diary are devices which seek to improve precision and possibly speed of delivery. They do not of themselves change what the actual set of relevant documents for an instantaneous context/query. Thus I think that the parallels with traditional IR are stronger than you anticipate. For example, having established what the relevant document set is, we can perform a baseline run which makes no use of the context-diary and the subsequent runs with various permutations of context-diary and perhaps context use. A useful parallel with traditional IR evaluation might be to think in terms of establishing baseline performance and then incorporating relevance feedback in the retrieval model in an attempt to improve the ranking of relevant documents.
Gareth: So I think that we should define a query and find out what the relevant documents are. Further NEW queries can be defined based on change of context (e.g time, location) for which we also need to find (manually) the relevance set. The problem here is that gathering relevance data is expensive. Hopefully we can make some use of the often incremental change in the relevance set as the context changes incrementally. Peter: I have just re-read Rhodes and Maes; they just measure precision, and do so by user feed-back; there is nothing on recall and no relevance sets; maybe our initial experiments should do this too. Apr 4: Gareth: I'll take a look at this again myself, it would be nice if possible to have some sort of formal relevance set since this will enable us to get some sort of quantitative understanding of the effects of the use of the diary: it's very hard to draw solid conclusions about algorithms from user studies - presumably every time you change an algorithm or a parameter you need to perform another equivalent user study.
The third problem is that CAR typically involves a large number of contextual fields. Evaluation must take place in a highly multi-dimensional space.
Gareth: This needs careful thought. If all the context fields are key to describing the information need, then all of them need to be taken into account in deciding the relevance set. I suggest that at least to begin with we should deal only with a very limited number of fields, so that we can think about the retrieval models carefully in the evaluation. I don't think that we want to get bogged down in issues related to a high number of different continuous sensor values at this point. Peter: I agree; it would be good to have more than one contextual field, but I see no reason to go beyond two or three in our experiments. Apr 4: Gareth: Yes, I would agree with 2 or 3 as a good starting point.
The fourth problem is that CAR systems that are used in the field are often tightly focussed. They might be tied to one application, one document collection, one style of query (perhaps automatically generated and just using location). There is a danger that an an evaluation of such a focussed system may have no value beyond the system itself. We discuss this issue further below.
We need to find ways of surmounting these difficulties. This note tries to make a modest start in defining some useful basic experiments.
Gareth: I think that it would be useful to distinguish between different field types in the query. You seem to refer to the entire query as the "context", search words and context-sensor values (raw values possibly abstracted to a higher level in the query) Is this correct? Peter: Yes: perhaps I should have been more careful in separating the current context from the query derived from the current context
Some principles governing our approach to CAR evaluation are as follows:
Gareth: Recall is important for cache assessment. Gathering full relevance data for a query is a problem (Robertson may discuss pooling in his paper), but hopefully we can make some reasonable stab at this. Peter: Recall and caches are discussed below; I think measuring recall, or at rather recall degradation, for a cache is a much easier problem than the general one; general strategy: concentrate on measuring precision, except with caching, where recall is obviously crucial?
Gareth: I'm not aware of work in the use of caching for retrieval. I'm aware of work on approximations in the retrieval algorithms to increase speed so the answers may not be the best possible), but I can't quote references on this off hand.
A lot of published work on evaluation is closely tied to individual systems, and of limited general use. However, Rhodes and Maes [1] provide an exception. Their evaluation related to a number of systems, and had a general outlook. They performed two sets of experiments: the first was a traditional user field trial (actually field trials are easier for them, because their system is designed for use on static terminals, i.e. the field trial is closer to laboratory conditions). The second was an exercise in relevance feedback: users gave a score out of 5 to each document that was delivered. This was used to calculate an overall average value for precision. They go on to explain that a document can get a top score for precision, but still get a zero score for usefulness! This is because the document may be already known to the user, indeed in some of the experiments the documents had actually been written by the user to whom they were delivered. As a result Rhodes and Maes added to their systems features to cut out documents that were relevant but not useful. The result of this could even be that the measured precision was worse, but the overall usefulness was better -- a telling support to Robertson's caveats about the value of laboratory experiments. Rhodes and Maes do not report any evaluations concerned with recall: this would be harder to evaluate (is this a general truth?). Gareth: 14 April: Yes. We could filter previously seen documents from the output which would have the effect of improving the precision of the unseen items - although since relevant old items are being ignored precision would still be lower, but since the old files can be removed automatically perhaps we should.
We have mentioned the danger that results of an application might be application-specific. One way of combatting this is to calculate some metrics that categorise particular applications, and then to relate the results to the metrics. The metrics might be static ones relating to the document collection, or dynamic ones relating to change or to pattern of usage.
Consider for example two metrics: (a) the rate of change of (some aspect of) the current context, and (b) the average time between successive retrieval requests. Consider also an experiment that finds, for a certain application, the optimal time to look ahead when calculating the context-of-interest. This optimum might be application-specific, but if it were related to (a) and (b) it would be more generally useful. With luck, for example, further experiments might show that the optimum was, say, linearly related to metric (b), assuming (a) was kept constant. The experiments would then have much bigger import.
Metrics are, of course, much easier for inherently numeric quantities. With location-based applications, some important metrics might be: (1) the range of locations covered by the document collection; (2) their average density (e.g. document sites per square mile); (3) the range of locations covered by a typical user during a session. (4) the way the score given by the matching algorithm for two locations decays as the locations move apart, e.g. the scores at one mile and ten miles (clearly if decay is slow, an inaccurate prediction for location will still give reasonable results, but if decay is fast the prediction needs to be spot-on. (This issue requires thought: my hunch is that it does not matter too much if you only have one contextual field: you might use the N best matches however low their scores were; it does, however, matter a lot in multi-field cases if the matching algorithms for different fields have radically different decay rates, and not all data has every field, there would be a lack of "fairness" between fields unless the difference in decay rates was allowed for.)
Assume we want to compare two sets of retrieval results, where each result is a set of documents, each with an associated score. It is adequate to represent each document by a unique ID, since. for comparisons, we do not generally want to look at the content of the retrieved documents, but only at the degree to which the two results contain the same set of documents with the same scores. Typically one result could be a benchmark against which the other is measured. For example the benchmark might relate to the user's real context and the other result might relate to the user's predicted (and possibly wrong) context. As a second example, the benchmark may relate to a default matching algorithm and the other result may relate to a new, hopefully improved matching algorithm. In all cases it is useful if the evaluation yields some number (positive or negative) to measure closeness to the benchmark. We call this the closeness-score.
Generally the overall evaluation will involve comparing a large number of pairs of results, and drawing some overall conclusion from the values of the closeness-scores, e.g. from their average. Gareth: 14th April: We could look at the change in precision at a fixed rank; we could also compare average precision vales, look at overlap and the number of relevant items retrieved.
In general a retrieval result will consist of the N highest-scoring documents, with the proviso that all scores must exceed some threshold T. Two extreme cases of this are (a) setting T to 0, so that we just get the N best-scoring documents, or (b) setting N to infinity, so that we get all the documents that beat the threshold. Although (a) has been used for some evaluation experiments, we believe it is not apposite to CAR: one of our tenets is that if no documents beat the threshold you deliver nothing; thus an evaluation that includes documents below the threshold does not relate to reality. Nevertheless (b) is not completely satisfactory either: the value of T is arbitrary, and indeed in a real application the user might continually change the value of T, according to whether they feel that too few or too many documents were being delivered. Gareth: 14th April: Look at use of feedback for automatically adjusting the threshold.
An example of a pair of results is as follows. We assume the threshold T is 0.75. Benchmark:
Document A with score .98 Document B with score .85 Document C with score .76
Second result:
Document D with score .92 Document B with score .77
One simple approach -- a Boolean one in the sense that it ignores scores -- is to measure what proportion of the benchmark documents are in the second result; in the above example this is 1/3. A tweak could be to incorporate a negative factor if the second result contains documents not in the benchmark (like Document D in the above example). Gareth: 14th: I don't think one can set as absolute threshold in advance: there needs to be some element of "intelligence" and operational learning.
A more precise approach is to look at the way the scores of the documents in the benchmark have changed. With the data presented in the above example, we cannot do this, as the scores for Documents A and C in the second result are not recorded. To get around this we can set T as 0 when retrieving the second result: we then get all the document scores. This is the approach we are initially inclined to favour. Thus our approach is to set some (albeit rather arbitrary) values of T and N for the benchmark, and then measure how much the second result has changed the scores of those documents retrieved in the benchmark. Thus in our example we might find that Document A got a score of .52 and Document C a score of .74. Our average change of score, which we use as the closeness measure, is now ((.98 - .52) + (.92 - .85) + (.76 - .74)) / 3. The simplest approach is to ignore all documents that are not in the benchmark. If, however, we wanted to refine the experiment we could introduce negative factors for documents that were wrongly given high scores, like Document D in our example. Gareth: Apr 14: There are recall issues associated with this. We need to know if the documents are relevant. It would be useful to know what relevant material we have missed. It could be useful to find relevant material just below the existing cutoff for learning - but it there any way to do this?
Gareth: I think that this is where we need to start.
One of our research aims is to design and evaluate new algorithms for matching and scoring, e.g. an algorithm for scoring the match between two locations, or an algorithm for aggregating the scores of individual fields into an overall total. A simple experiment is as follows: take a fixed document collection D, and design a varied set of sample current contexts. For each current context, get a set of `experts' (this would probably need to be ourselves to start with) to define what documents from D should be delivered. The experiment can then test closeness to the experts' choice, the reference set. This experiment has the severe limitation of considering each context in isolation, whereas a lot of our work aims to exploit the history of how context is changing. However the experiment would make a start, and one could argue that exploiting history is covered by experiments concerned with the context-of-interest.
Gareth: The assumption of independence of each search to previous context and searches is, I think, essentially the same as the one made in standard IR evaluation. For example, it is acknowledged that people's knowledge changes over time as they gain information from documents or their context changes, and that this could potentially be taken into account in the search process. The issue of personalisation of search engines (part of the topic of Michelle Fisher's search for example) seeks to address this, and evaluation is extremely difficult for researchers in this area too.
Gareth: I think that your suggestion is about as good as we can do to start with. There seems to be one big assumption which I think we'll have to live with, but if we are not happy we can think some more, is that the relevance set in the current context is independent of past or the future. If we are happy with this we could (I have thought this idea through at the moment) simulate different diary scenarios and see how incorporating these in various ways affects retrieval behaviour. Also we could gather multiple relevance sets for changes incremental changes in context and see if chaining these together (as in a user making occasional reference as the retrieved list is revised based on changed context) improved precision.
The idea of the context-of-interest is that, if the retrieval system pretends the user's context is a bit ahead of the true context, this is more likely to retrieve documents the user wants. One approach is to base the context-of-interest on time, and take the context-of-interest N seconds ahead. We assume the context-of-interest is set by some form of linear extrapolation (e.g. Location depends on the user's rate and direction of travel). One question is: how far ahead should the context-of-interest be, i.e. what is a good value for N?
Gareth: The above seems a reasonable starting point for evaluation development. For the last point, won't this depend very much on the application under consideration, so it's hard to generalise. Peter: This point made me think; indeed there is a big danger in getting results that are so application-specific to be almost useless. To this end I added a section about metrics above.
Clearly the answer depends on (a) user preferences on how good a score they would give to information delivered in advance, and (b) the predictability of change.
Gareth: (a) is an important point. If context-of-interest delivery was working really well, the user probably wouldn't think about it too much, since the system would really just be delivering what they wanted to know WHEN they wanted to know it, even though the real relevance is to sometime in the future.
An evaluation to find good values for N in relation to (a) can only be done via field experiments. In terms of aspect (b) it is clearly bad if predictions are usually wrong. I suggest that initially we try to separate this aspect from (a). We could get data for this by tracking real users in the field (could we get data from Lancaster, who have recorded a set of user logs?). A possible experiment is as follows. We set some threshold for prediction, e.g. an average of 80% of predictions must be accurate. One way of defining accurate is that the prediction is better than nothing at all (i.e. better than assuming the context does not change and thus assuming the next context will be the same as the present one). With this definition, a prediction is accurate if it turns out to be closer to the user's context in N seconds time than the current context is. The result of the experiment would be an upper limit on N, maxN, that cannot be exceeded if 80% accuracy is to be preserved. Getting an optimal value of N for case (a) might be then confined by the limit that it should not be greater than maxN.
Gareth: This again seems sensible. One way to implement this might be to say, what is relevant in this context? For which we could use the data set developed for evaluation of Matching Algorithms. Then simulate different contexts by moving the user physically or temporally, saying this is the context now, it is changing in this manner, what do we predict the relevance set will be at the known point in the future. Obviously if the prediction is completely right we get 100% of the ideal, we can then simulate the effects of mistakes. I just spotted a problem with this, and I think your idea as well, if the prediction is wrong, we don't know what the true relevance set is for the actual outcome after N seconds. Peter: I do not understand this, particularly the last two sentences; we can surely simulate both correct prediction (e.g. based on past real data) and incorrect prediction? Apr 4: Gareth: Maybe I didn't explain this very well. I'm still not sure on this. If we predict that the user will move to point a1 and we know the relevance data for a1, we know that the relevance set at the current point a0 should be the a1 relevance set (because we say that the context-of-interest is a1), but if the user instead goes to a2 (for which we don't know the relevance set), how can we assess the quality of the context-of-interest prediction, some of the a1 documents may be relevant at a2, but which ones. We could assess relevance for multiple value ai, but this could be very time expensive and we would need to work carefully to avoid user learning and adverse assessment sequencing effects. Apr 4: Peter: I think this is a different (and interesting) experiment. One approach is to take the document retrieved for the context a0 as the norm, and those retrieved for a2 as the ideal. If we then look at the documents retrieved for a1, can we measure if these are closer to the ideal (prediction right) than the norm (no prediction) is? The experiment could be based on the best three documents, say: are the three best for a1 closer to the three best for a2 than the three best for a0 are? For more sophistication we could take account of scores. Gareth: 14th: Yes, good starting point. This gives a lower bound, but not an upper bound on performance I think. The experiment could be based on the best three documents, say: are the three best for a1 closer to the three best for a2 than the three best for a0 are? For more sophistication we could take account of scores. Gareth: 14th: Use scores and full list for experiments - can easily cut length of list to simulate real life and see what the effects are.
Although we have assumed that prediction will be by linear extrapolation, given data on user movements one might be able to hone the algorithms: e.g. are there some heuristics to show when unpredictable change is likely (e.g. when there is sudden heavy rain!)? If such cases can be detected the value of maxN might be temporarily reduced during danger periods. This might be the subject of later, more sophisticated, experiments.
Gareth: This aspect would seem to me to be hugely application dependent, particularly in terms of the context type being monitored. Peter: I agree; all this is only likely to be a long-term focus.
The value of N is also, I believe, likely to depend on the document collection. If we imagine one document collection that relates to cathedrals, these may, perhaps, be an average of 50 miles apart. If the collection related to eating places, the average might be a few hundred yards apart. Setting N so that the context-of-interest was one mile ahead would make almost no difference for the cathedrals, but would be make a huge difference for the eating places -- indeed it would be a poor strategy as eating places near to the user would be missed (todo questions of already seen). If one purpose of the context-of-interest is to tell the user what is coming, we should base N on the sparseness or otherwise of the document collection. We could certainly do some experiments to evaluate this aspect: e.g. for a fixed document collection and for different values of N, compare the highest ranked (or three highest ranked, etc) document delivered, compared with what would be delivered at the true current context (the case where N is 0). We could then get a feel for what value of N made any difference, e.g. delivered a different highest ranked document 20% of the time. Clearly the value of N depends on the range covered by the document collection, and how densely the range is covered.
Gareth: I agree completely with the analysis of the problems. The issues of true vs predicted context relevance sets are still true here I think.
The key issue with using a cache is recall, not precision, so I suggest we concentrate on this. A relatively easy quantity to measure, and one that is useful, is recall degradation, i.e. what proportion of relevant documents are lost by using the cache rather than the original document collection.
Any evaluation needs to be related to the aims of the cache, e.g. (a) to make retrieval faster, or (b) to cover disconnected operation. Within (b) there may be a further constraint that the cache be small enough to fit in, say, a PDA.
In [3] we have considered ways of building context-aware caches. Our suggested approach is to choose some key parameters for the cache, e.g. that it should be useful for M minutes or cover a range of locations RL, and make a forecast of the union of contexts that the user will enter within these key parameters. We call the latter the context-union. We then make a retrieval, using the context-union as the current context, and select for the cache those documents whose score exceeds some threshold T. Subjects for choice and evaluation are:
Gareth: We need to find some good caching algorithms to try.
Gareth: I'm not sure that R is very useful way to go. The size of the available memory on a device will be independent of the size of the document collection. I don't see that varying cache size depending on the size of the document collection is very meaningful. I would have thought that you would aim for the largest available cache size and go for this. The update strategy would depend on the caching algorithm, but could also depend on the monitored variation of context. If things are changing quickly you would want to update more often than if they are changing slowly. In terms of an experiment you want to see how the effect of your caching strategy on recall and to a lesser extent precision. Peter: I do not agree; an application designer needs to make a decision on what proportion of free store to use for caches and how much for other things, and some experimental results giving values for R might be a useful guide. In any case I do not think this is a big issue: it is just a matter of how results are presented. Apr 4: Gareth: Exploring the effects of different cache sizes is clearly important, but one would need to be careful with R. What is wrong with a large cache for a small collection, or a small cache for a large collection? Also, you need to make sure that the cache is realistic, for a very large collection there is little point in having an R value requiring a cache size of terabyte in the near future.
Another issue, important when storage is tight or when transmission of items to a cache is slow and/or expensive, is the wastage in the cache, i.e. the proportion of items in the cache that are never actually used during the lifetime of the cache. This wastage should be easy to measure.
Results of tests performed so far:
Todo