Peter Brown and Gareth Jones
Department of Computer Science, University of Exeter, Exeter EX4 4QF,
UK
P.J.Brown@ex.ac.uk, G.J.F.Jones@ex.ac.uk
This experiment measures recall degradation when the user's path of locations strays outside the area covered by a cache.
The experiment consisted of a number of individual tests. Each test had a starting point, and the cache for the test was built using this starting point, i.e. it was the centre of the cache-building square. The starting points were chosen by hand, some being in the middle of popular towns and cities and some being `in the middle of nowhere'. The choice was based on hunch of where a tourist aid would be most used, not on the basis of any deep analysis.
In each test the user was assumed to proceed steadily in a straight-line
path, performing a retrieval at regular retrieval points, a fixed
distance apart.
There were 11 retrieval points: the first was at the starting point of the test;
the sixth was on the very edge of the square used to build the cache, and the
remainder were increasingly further outside this area.
(Actually the paths were diagonal ones, that went from the starting point to the
NE corner of the square, and then an equal distance beyond.)
The straight line paths were chosen so that the user remained in
the overall area covered by the document collection, i.e. the South West of England (and did not stray into
the sea!).
The following picture shows how the user's path proceeds.
The retrieval at each of the 11 retrieval points used the cache; the results of this retrieval were compared
with what would have been retrieved if the whole document collection had
been used rather that the cache.
Each retrieval retrieved all the documents whose score was greater than a given
threshold (a document's score would not, of course, be affected by
whether it came from the cache or from the original
document collection).
A count was made of the total number of documents "lost" over the
11 retrieval points (i.e. documents retrieved from the original
document collection but not in the cache).
We call these lost retrievals.
A count was also made of the total number
of documents retrieved from the cache.
Each test was repeated -- we call this a sequence of sub-tests --
with different thresholds (the same threshold was used for all 11 retrieval points).
The first of the 11 retrieval points is at the starting point, and assuming this
retrieval does not have a lower threshold than that used to build the cache,
will have no failures -- i.e. everything that would have been retrieved
from the original document collection is in the cache.
The last of the 11 retrieval points is well outside the cache-building square, and this point
is likely to have the most failures.
Of the remaining points, the nearer they are to the start the less failures are likely.
In the tests the best results are likely to be where there are plenty of
attractions inside the cache-building square and many fewer just outside.
Thus tests that started at the centre of cities had good results.
Not surprisingly the worst results were from the test on Dartmoor, which
started in a wild area with few attractions; its cache was therefore small,
and its rate of failure turned out to be even worse than one would expect
from the proportional size of the cache.
Tourists do not flock to the Somerset Levels yet surprisingly
the test centred here (at the small town of Somerton) had the
worst results, i.e. the most sites missed by the cache.
It was especially proportionally worse than the other case for high retrieval thresholds (e.g at a 99% threshold, Plymouth had 26 hits and no losses, whereas Somerton had
6 hits and 2 losses).
The direction of progress from the starting point took the user ever closer to
more important tourist areas such as Wells, Bath and Glastonbury.
The number of retrievals of course increased as successively lower thresholds were used, but
the proportion of lost retrievals increased quite sharply.
This was surprising, though some increase would be expected.
(To take an illustrative example if you are at the last retrieval point,
about 14 kilometres outside the cache-building square, you might, with a low threshold, get sites
20 kilometres away, and these might be 34 kilometres outside the cache-building square;
such far-away sites are unlikely to be in the cache, even though the cache
is built with a low threshold.)
Todo: more refined conclusions.
We have performed some preliminary tests of context-aware caching.
In order to reduce the number of variables we concentrated on one
contextual field, location.
The other contextual fields were kept constant during the experiments, and were not
active in matching.
Our document collection consisted of information about tourist sites, each of which had an associated location.
We chose a set of different starting points, all well within
the area covered by the document collection, and each representing
a possible place a tourist might start wanting information.
For each starting point,
we built a context-aware cache that encompasses sites whose location matched a square centred on
the starting point.
We call this square the cache-building square.
The cache-building square had sides of 20 kilometres long.
We set a fairly generous threshold of 50% for inclusion in the cache,
and as a result about one fifteenth of the documents in our collection
went into the cache.
(Many of these were outside the cache-building square, since locations outside,
but close to, the square would still get a good score.)
We assumed the cache was being used during disconnected operation, and thus
there was no way of updating it.
Our tests of the cache were quite demanding: we assumed that the user
went in a straight-line path, starting at the centre of the cache-building area,
proceeding to the edge of the square, and continuing until he was an equal distance
outside.
We assume he made retrievals at 11 points equally spaced along the way.
(Thus the first five points were inside the square, the next one was on the edge,
and the remaining five were increasingly far outside.)
We counted the total number of documents retrieved at the 11 points.
We then repeated each experiment using the original document collection
rather than the cache, and again counted the number of documents retrieved.
The difference between the two numbers represented the number of potential
retrievals lost because of the use of the cache, i.e. the lost retrievals.
We accumulated these numbers for all our starting points.
We repeated these experiments using different threshold scores (e.g. 99%, 98%. 95%, ...)
for retrieval.
The number of lost retrievals increased dramaticly as the threshold
decreased.
We expected some increase (a lax threshold would allow the retrieval of documents associated with
locations a long way from the cache-building square) but were surprised at its magnitude.
As a final step we tried different algorithms for matching two locations:
one algorithm decayed linearly according to the distance apart of the locations, and the other
decayed as N squared.
Some preliminary conclusions are:
Commentary on results
Possible text for paper
APPENDIX A: results of individual tests using default algorithm and cache of size 39
Cache is built using a square with sides 20 km. User's path has steps of (2000, 2000) metres, with a total of 11 retrievals. Amount of look-ahead (both E and N) is 3km. Cache size is 39. (Size is set as a fixed value, derived from original size of 63).
Point furthest from centre of cache (25300000, 07300000) is 2330563, 0847571; distance away is: 23.1709 kilometres
Threshold | Total benchmark retrievals | Retrievals lost |
---|---|---|
99% | 5 | 2 |
98% | 16 | 4 |
96% | 51 | 16 |
93% | 258 | 93 |
Default matching algorithm used was: (version pjb May 29 2002) .
Commentary on this starting point: the cache is centred on a wild moorland area.
Cache is built using a square with sides 20 km. User's path has steps of (2000, 2000) metres, with a total of 11 retrievals. Amount of look-ahead (both E and N) is 3km. Cache size is 39. (Size is set as a fixed value, derived from original size of 78).
Point furthest from centre of cache (29500000, 09500000) is 3096091, 0852581; distance away is: 17.5841 kilometres
Threshold | Total benchmark retrievals | Retrievals lost |
---|---|---|
99% | 28 | 1 |
98% | 49 | 7 |
96% | 159 | 20 |
93% | 465 | 148 |
Default matching algorithm used was: (version pjb May 29 2002) .
Commentary on this starting point: this may be a favourable case as the cache is centred on an area with lots of attractions, and the area outside the cache has fewer attractions.
Cache is built using a square with sides 20 km. User's path has steps of (2000, 2000) metres, with a total of 11 retrievals. Amount of look-ahead (both E and N) is 3km. Cache size is 39. (Size is set as a fixed value, derived from original size of 44).
Point furthest from centre of cache (25000000, 05600000) is 2377095, 0801320; distance away is: 27.0573 kilometres
Threshold | Total benchmark retrievals | Retrievals lost |
---|---|---|
99% | 25 | 0 |
98% | 49 | 0 |
96% | 152 | 13 |
93% | 376 | 107 |
Default matching algorithm used was: (version pjb May 29 2002) .
Commentary on this starting point: this may be a favourable case as the cache is centred on an area with lots of attractions, and the area outside the cache has fewer attractions; the cache may, however, be smaller as Plymouth is on the sea, so part of the cache area covers the sea -- which has no tourist attractions.
Cache is built using a square with sides 20 km. User's path has steps of (2000, 2000) metres, with a total of 11 retrievals. Amount of look-ahead (both E and N) is 3km. Cache size is 39. (Size is set as a fixed value, derived from original size of 68).
Point furthest from centre of cache (35300000, 13300000) is 3341510, 1329660; distance away is: 18.9003 kilometres
Threshold | Total benchmark retrievals | Retrievals lost |
---|---|---|
99% | 6 | 0 |
98% | 25 | 3 |
96% | 164 | 20 |
93% | 483 | 183 |
Default matching algorithm used was: (version pjb May 29 2002) .
Commentary on this starting point: the cache is centred on non-prime tourist area, but there are prime areas nearby.
Cache is built using a square with sides 20 km. User's path has steps of (2000, 2000) metres, with a total of 11 retrievals. Amount of look-ahead (both E and N) is 3km. Cache size is 39. (Size is set as a fixed value, derived from original size of 95).
Point furthest from centre of cache (25300000, 07300000) is 2309563, 0594948; distance away is: 25.9494 kilometres
Threshold | Total benchmark retrievals | Retrievals lost |
---|---|---|
50% | 0 | 0 |
20% | 1 | 1 |
10% | 13 | 2 |
5% | 340 | 139 |
Non-default algorithm used was: Context Matcher with N-squared location-matching algorithm: version of Sept 4.1 2002.
Commentary on this starting point: the cache is centred on a wild moorland area.
Cache is built using a square with sides 20 km. User's path has steps of (2000, 2000) metres, with a total of 11 retrievals. Amount of look-ahead (both E and N) is 3km. Cache size is 39. (Size is set as a fixed value, derived from original size of 112).
Point furthest from centre of cache (29500000, 09500000) is 2955010, 1129010; distance away is: 17.907 kilometres
Threshold | Total benchmark retrievals | Retrievals lost |
---|---|---|
50% | 1 | 0 |
20% | 10 | 0 |
10% | 42 | 4 |
5% | 534 | 192 |
Non-default algorithm used was: Context Matcher with N-squared location-matching algorithm: version of Sept 4.1 2002.
Commentary on this starting point: this may be a favourable case as the cache is centred on an area with lots of attractions, and the area outside the cache has fewer attractions.
Cache is built using a square with sides 20 km. User's path has steps of (2000, 2000) metres, with a total of 11 retrievals. Amount of look-ahead (both E and N) is 3km. Cache size is 39. (Size is set as a fixed value, derived from original size of 71).
Point furthest from centre of cache (25000000, 05600000) is 2733038, 0700543; distance away is: 27.1825 kilometres
Threshold | Total benchmark retrievals | Retrievals lost |
---|---|---|
50% | 0 | 0 |
20% | 6 | 0 |
10% | 41 | 0 |
5% | 426 | 130 |
Non-default algorithm used was: Context Matcher with N-squared location-matching algorithm: version of Sept 4.1 2002.
Commentary on this starting point: this may be a favourable case as the cache is centred on an area with lots of attractions, and the area outside the cache has fewer attractions; the cache may, however, be smaller as Plymouth is on the sea, so part of the cache area covers the sea -- which has no tourist attractions.
Cache is built using a square with sides 20 km. User's path has steps of (2000, 2000) metres, with a total of 11 retrievals. Amount of look-ahead (both E and N) is 3km. Cache size is 39. (Size is set as a fixed value, derived from original size of 105).
Point furthest from centre of cache (35300000, 13300000) is 3396450, 1186420; distance away is: 19.6703 kilometres
Threshold | Total benchmark retrievals | Retrievals lost |
---|---|---|
50% | 0 | 0 |
20% | 1 | 0 |
10% | 13 | 2 |
5% | 561 | 251 |
Non-default algorithm used was: Context Matcher with N-squared location-matching algorithm: version of Sept 4.1 2002.
Commentary on this starting point: the cache is centred on non-prime tourist area, but there are prime areas nearby.
Total number of tables is 4. Total size of the 4 caches is 156. Total number of retrievals (over the 11 retrieval points) from the cache is 2311. Total number of retrievals lost is 617.
Total number of tables is 4. Total size of the 4 caches is 156. Total number of retrievals (over the 11 retrieval points) from the cache is 1989. Total number of retrievals lost is 721.
Algorithm | Look-ahead | Benchmark retrievals | Retrievals lost |
---|---|---|---|
default | 0 | 2311 | 799 |
default | 3 | 2311 | 617 |
default | 5 | 2311 | 517 |
default | 10 | 2311 | 415 |
default | 15 | 2311 | 627 |
default | 20 | 2311 | 1298 |
research | 0 | 1989 | 832 |
research | 3 | 1989 | 721 |
research | 5 | 1989 | 648 |
research | 10 | 1989 | 586 |
research | 15 | 1989 | 686 |
research | 20 | 1989 | 1095 |