Method for computing near neighbors of a query point in a database6148295Abstract A method for determining k nearest-neighbors to a query point in a database in which an ordering is defined for a data set P of a database, the ordering being based on l one-dimensional codes C.sub.1, . . . , C.sub.1. A single relation R is created in which R has the attributes of index-id, point-id and value. An entry (j,i,C.sub..epsilon.j (p.sub.i)) is included in relation R for each data point p.sub.i .EPSILON.P, where index-id equals j, point-id equals i, and value equals C.sub..epsilon.j (p.sub.i). A B-tree index is created based on a combination of the index-id attribute and the value attribute. A query point is received and a relation Q is created for the query point having the attributes of index-id and value. One tuple is generated in the relation Q for each j, j=1, . . . , l, where index-id equals j and value equals C.sub..epsilon.j (q). A distance d is selected. The index-id attribute for the relation R of each data point p.sub.i is compared to the index-id attribute for the relation Q of the query point. A candidate data point p.sub.i is selected when the comparison of the relation R of a data point p.sub.i to the index-id attribute for the relation Q of the query point is less than the distance d. Lower bounds are calculated for each cube of the plurality of cubes that represent a minimum distance between any point in a cube and the query point. Lastly, k candidate data points p.sub.i are selected as k nearest-neighbors to the query point. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE I
______________________________________
I G(I - 1) .DELTA.
s(I) i(I)
A.sub.1 (x.sub.2, x.sub.1)
______________________________________
00 00 1 (x.sub.1, x.sub.2)
01 00 00 2 (x.sub.2, x.sub.1)
10 01 -01 00 2 (x.sub.2, x.sub.1)
11 11 11 1 (1 - x.sub.1, 1 - x.sub.2)
______________________________________
Affine transformations for the case n=3 are summarized in Table II:
TABLE II
______________________________________
I G(I - 1)
.DELTA. s(I) i(I) A.sub.1 (x.sub.3, x.sub.2,
______________________________________
x.sub.1)
000 000 1 (x.sub.1, x.sub.2, x.sub.3)
001 000 000 2 (x.sub.2, x.sub.3, x.sub.1)
010 001 -001 000 2 (x.sub.2, x.sub.3, x.sub.1)
011 011 011 3 (x.sub.3, 1 - x.sub.2, 1 - x.sub.1)
100 010 +001 011 3 (x.sub.3, 1 - x.sub.2, 1 - x.sub.1)
101 110 110 2 (1 - x.sub.2, 1 - x.sub.3, x.sub.1)
110 111 -001 110 2 (1 - x.sub.2, 1 - x.sub.3, x.sub.1)
111 101 101 1 (1 - x.sub.1, x.sub.2, 1
______________________________________
- x.sub.3)
Affine transformations for the case n=4 are summarized in Table III:
TABLE III
______________________________________
I G(I - 1)
.DELTA. s(I) i(I)
A.sub.1 (x.sub.4, x.sub.3, x.sub.2,
x.sub.1)
______________________________________
0000 0000 1 (x.sub.4, x.sub.3, x.sub.2, x.sub.4)
0001 0000 0000 2 (x.sub.2, x.sub.3, x.sub.4, x.sub.1)
0010 0001 -0001 0000 2 (x.sub.2, x.sub.3, x.sub.4, x.sub.1)
0011 0011 0011 3 (x.sub.3, x.sub.4, 1 - x.sub.2, 1 -
x.sub.1)
0100 0010 +0001 0011 3 (x.sub.3, x.sub.4, 1 - x.sub.2, 1 -
x.sub.1)
0101 0110 0110 2 (1 - x.sub.2, 1 - x.sub.3, x.sub.4,
x.sub.1)
0110 0111 -0001 0110 2 (1 - x.sub.2, 1 - x.sub.3, x.sub.4,
x.sub.1)
0111 0101 0101 4 (x.sub.4, 1 - x.sub.3, x.sub.2, 1 -
x.sub.1)
1000 0100 +0001 0101 4 (x.sub.4, 1 - x.sub.3, x.sub.2, 1 -
x.sub.1)
1001 1100 1100 2 (x.sub.2, 1 - x.sub.3, 1 - x.sub.4,
x.sub.1)
1010 1101 -0001 1100 2 (x.sub.2, 1 - x.sub.3, 1 - x.sub.4,
x.sub.1)
1011 1111 1111 3 (1 - x.sub.2, 1 - x.sub.3, 1 - x.sub.4, 1
- x.sub.1)
1100 1110 +0001 1111 3 (1 - x.sub.2, 1 - x.sub.3, 1 - x.sub.4, 1
- x.sub.1)
1101 1010 1010 2 (1 - x.sub.2, x.sub.3, 1 - x.sub.4,
x.sub.1)
1110 1011 -0001 1010 2 (1 - x.sub.2, x.sub.3, 1 - x.sub.4,
x.sub.1)
1111 1001 1001 1 (1 - x.sub.1, x.sub.3, x.sub.2, 1 -
x.sub.4)
______________________________________
The following exemplary pseudocode illustrates an Algorithm C that is used for calculating the code C(v) of a given point v=(v.sub.n, . . . , v.sub.1), where each v.sub.j is an m-bit number, v.sub.j =.SIGMA..sub.i=1.sup.m v.sub.ij 2.sup.-i (v.sub.ij .epsilon.{0,1}, j=1, . . . , n, i=1, . . . , m). Algorithm C produces the representation C(v)=.SIGMA..sub.i=1.sup.m I.sub.i, 2.sup.-in (0.ltoreq.I.sub.i <2.sup.n (i=1, . . . , m), which also produces the representation v=.SIGMA..sub.i=1.sup.m 2.sup.-i v.sup.i, where v.sup.i =(v.sub.i1, v.sub.i2, . . . , v.sub.in) (i=1, . . . , m). All the operations on vectors in Algorithm C are component-wise. Algorithm C; input: v=(v.sub.n, . . . , v.sub.1); output: C=C(v)=.SIGMA..sub.i=1.sup.m I.sub.i, 2.sup.-in 1. i:=1; s:=0; P=I 2. while v.noteq.0 do 2.1 Convert v into a bit vector and a remainder: (a) v.sup.i :=.left brkt-bot.2v.right brkt-bot. (b) v:=2.backslash.v-v.sup.i 2.2 Compute} I.sub.i : I.sub.i :=J(P.cndot.(s.thrfore.v.sup.i)) 2.3 Compute the new transformation: (a) s:=s.cndot.(P.cndot.s(I.sub.i)) (b) P:=P.cndot.P(I.sub.i) 2.4 Increment} i: i:=i+1. The mapping C is adequate for unambiguous coding of n-dimensional data, that is, the mapping C is one-to-one. As proof, suppose p and q are distinct points in [0,1).sup.n. There exist unique representations p=.SIGMA..sub.i=1.sup..infin. p.sup.i 2.sup.-i and q=.SIGMA..sub.i=1.sup..infin. q.sup.i 2.sup.-i }, where {p.sup.i,q.sup.i }.OR right.{0,1.backslash.}.sup.n (i=1,2, . . . ), and for every natural N and every j (1.ltoreq.j.ltoreq.n), there exist k,1>N such that p.sub.k.sup.i =q.sub.I.sup.i =0. Let i (i.gtoreq.1) be the smallest index such that p.sup.i .noteq.q.sup.i. It follows that in the hierarchy of the sub-cubes of the unit cube that are obtained by repeatedly halving all the dimensions, there exists a sub-cube S.sub.0 of the unit cube having edges of length 2.sup.-i+1, such that {p,q}.OR right.S.sub.0, whereas there exist two disjoint sub-cubes S.sub.1,S.sub.2 of S.sub.0, having edges of length 2.sup.-i, such that p.EPSILON.S.sub.1 and q.EPSILON.S.sub.2. The points of S.sub.1 and S.sub.2 are mapped under J into two disjoint intervals of [0,1) each of length 2.sup.-in. Thus, J(p).noteq.J(q). For convenience, the following notation is introduced. For m=1,2, . . . , 1. For k.sub.1, . . . , k.sub.n such that 0.ltoreq.k.sub.j .ltoreq.2.sup.m -1 (j=1, . . . , n), the cube S(k.sub.1, . . . , k.sub.n,; m) is defined to be equal to {x.epsilon.R.sup.n .vertline.k.sub.j 2.sup.-m .ltoreq.x.sub.j <(k.sub.j +1), 2.sup.-m, j=1, . . . , n). 2. S(m) denotes the collection of all the cubes S(k.sub.1, . . . , k.sub.n ; m). 3. For every S.EPSILON.S(m), the members of S(m+1) that are contained in S are denoted by S(m+1;S). Also, if S.EPSILON.S(m) is not the sub-cube that is mapped under C into [1-2.sup.-mn,1), then the member of S(m) that succeeds S in the ordering induced by C is denoted by S'. For example, if S is mapped into [k2.sup.-mn,(k+1)2.sup.-mn), then S' is mapped into [(k+1)2.sup.-mn,(k+2)2.sup.-mn). Note that for each m, the members of S(m) are mapped under C into 2.sup.mn disjoint (half open) subintervals of [0,1), each of length 2.sup.-mn. For m=1,2, . . . , and for every S.EPSILON.S(m), except for the last S relative to the ordering induced by C, the last member of S(m+1;S) and the first member of S(m+1;S') are adjacent sub-cubes, that is, they have a common facet. As proof, first consider the case of m=1. Suppose S.EPSILON.S(1) is mapped into the interval T(1,k).tbd.[k2.sup.-n,(k+1)2.sup.-n) where 0.ltoreq.k.ltoreq.2.sup.n -2. Thus, the last member of S(2;S) is mapped into the interval T.sub.L .tbd.T(2,2.sup.n (k+1)-1)=[(k+1)2.sup.-n -2.sup.-2n,(k+1)2.sup.-n), while the first member of S(2;S') is mapped into T.sub.R .tbd.T(2,2.sup.n (k+1))=[(k+1)2.sup.-n,(k+1)2.sup.-n +2.sup.-2n). If a point v.sub.R =1/2v.sub.R.sup.1 +1/4v.sub.R.sup.2 +.epsilon..sub.R (where v.sub.R.sup.i .EPSILON.{0,1}.sup.n, i=1,2, and .epsilon..sub.R .EPSILON.[0, 1/4).sup.n) has C(v.sub.R).EPSILON.T.sub.R, then the first two values that Algorithm C calculates for point v.sub.R are I.sub.1 =k+1 and I.sub.2 =0. Thus, necessarily, J(v.sub.R.sup.1)=k+1 which is the same as v.sub.R.sup.1 =G(k+1). Now, the first values of the other objects computed by Algorithm C are S.sub.R.sup.1 =S.sub.R.sup.0 .circle-w/dot.(P.sub.R.sup.0 .cndot.s(k+1)=s(k+1), and P.sub.R.sup.1 =P.sub.R.sup.0 .cndot.P(k+1)=P(k+1). Thus, the other necessary condition for v.sub.R =1/2v.sub.R.sup.1 +1/4v.sub.R.sup.2 +.epsilon. to be mapped into T.sub.R is: 0=I.sub.2 =J(P.sub.R.sup.1 .cndot.(S.sub.R.sup.1 .circle-w/dot.v.sub.R.sup.2))=J(P(k+1).cndot.(s(k+1).circle-w/dot.v.sub.R. sup.2)), which, because G(0)=0, is the same as s(k+1).circle-w/dot.v.sub.R.sup.2 =0, or simply v.sub.R.sup.2 =s(k+1). Similarly, if a point v.sub.L =1/2v.sub.L.sup.1 +1/4v.sub.L.sup.2 +.epsilon..sub.L has C(v.sub.L).EPSILON.T.sub.L, then the first two values that Algorithm C calculates for v.sub.L are I.sub.1 =k and I.sub.2 =2.sup.n -1. Thus, necessarily, J(v.sub.R.sup.1)=k which is the same as v.sub.R.sup.1 =G(k). Because the first values of the other objects computed by Algorithm C are s.sub.L.sup.1 =s(k) and P.sub.L.sup.1 =P(k), the other necessary condition for v.sub.L =1/2v.sub.L.sup.1 +1/4v.sub.L.sup.2 +.epsilon. to be mapped into T.sub.L is: 2.sup.n -1=I.sub.2 =J(P.sub.L.sup.1 .cndot.(S.sub.L.sup.1 .circle-w/dot.v.sub.L.sup.2)=J(P(k).cndot.(s(k).circle-w/dot.v.sub.L.sup.2 )). Because G(2.sup.n -1)=e.sub.n, this is equivalent to v.sub.L.sup.2 =s(k).circle-w/dot.(P(k).cndot.e.sub.n)=s(k).circle-w/dot.e.sub.i(k), where e.sub.j denotes a unit vector with 1 in coordinate j. The fact that v.sub.L.sup.1 =G(k) and v.sub.R.sup.1 =G(k+1) implies that points mapped into T.sub.L and T.sub.R must respectively belong to adjacent members of S(1), but in order to prove that they belong to adjacent members of S(2), v.sub.L.sup.2 and v.sub.R.sup.2 must be relied on. In fact, it suffices to prove that v.sub.L.sup.2 and v.sub.R.sup.2 always differ in one coordinate. For odd k, s(k+1)=G(k-1), while for even k, s(k+1)=G(k). Thus, for odd k, s(k+1)=s(k). When k is odd, then v.sub.R.sup.2 =s(k+1)=s(k) and v.sub.L.sup.2 =s(k).circle-w/dot.e.sub.i(k), so v.sub.R.sup.2 and v.sub.L.sup.2 differ in precisely one coordinate, namely, i(k). Further, when k is even, v.sub.R.sup.2 =s(k+1)=G(k) and v.sub.L.sup.2 =s(k).circle-w/dot.e.sub.i(k), so: (a) if k=0, then V.sub.R.sup.2 =0 and v.sub.L.sup.2 =e.sub.1, so v.sub.R.sup.2 and v.sub.L.sup.2 differ in precisely one coordinate; (b) if k.tbd.0 (mod 4) and k>0, then s(k)=G(k-1)+e.sub.1, so v.sub.L.sup.2 =(G(k-1)+e.sub.1).circle-w/dot.e.sub.i(k) =G(k-2).circle-w/dot.e.sub.i(k), whereas v.sub.R.sup.2 =G(k). But G(k) and G(k-2) differ only in coordinates 1 and i(k-1). Also, i(k)=i(k-1). Thus, v.sub.R.sup.2 and v.sub.L.sup.2 differ in only one coordinate, namely, coordinate 1; and (c) if k.tbd.2 (mod 4) and k>0, then s(k)=G(k-1)-.backslash.e.sub.1, so v.sub.L.sup.2 =(G(k-1)-e.sub.1).circle-w/dot.e.sub.2 =G(k-2).circle-w/dot.e.sub.i(k). Thus, as in the previous case, v.sub.R.sup.2 and v.sub.L.sup.2 differ in only one coordinate, i(k). Finally, given S.EPSILON.S(m), where m>1, R denotes the member of S(1) that contains S. Thus, for every two points p,q in [0,1).sup.n, .parallel.p-q.parallel..sub..infin. .ltoreq.2.vertline.C(p)-C(q).vertline..sup.1/n. As proof, consider p,q.EPSILON.[0,1).sup.n, and denote by m the number such that 2.sup.-mn .ltoreq..vertline.C(p)-C(q).vertline.<2.sup.-(m-1)n. Without loss of generality, suppose C(q)<C(p). Because the 2.sup.-mn .ltoreq.C(p)-C(q), and because each member of S(m) is mapped by C into a half open interval of length 2.sup.-mn, points p and q cannot belong to the same member of S(m). On the other hand, because C(p)-C(q)<2.sup.-(m-1)n, there exists a k, 1.ltoreq.k.ltoreq.2.sup.(m-1)n -1, such that k-1)2.sup.-(m-1)n .ltoreq.C(q)<C(p)<(k+1)2.sup.-(m-1)n. If either C(p)<k2.sup.-(m-1)n or C(q).gtoreq.k2.sup.-(m-1)n, then both p and q belong to the same member of S(m) and, hence, .parallel.p-q.parallel..infin.<2.sup.-m =(2.sup.-mn).sup.1/n .ltoreq..vertline.C(p)-C(q).vertline..sup.1/n. Otherwise, C(q)<k2.sup.-(m-1)n .ltoreq.C(p). In this case, p and q belong to distinct members of S(m) that are mapped under C into consecutive intervals [(k-1)2.sup.-mn,k2.sup.-mn) and [k2.sup.-mn, (k+1)2.sup.-mn). These members of S(m) must be adjacent. Thus , .parallel.p-q.parallel..sub..infin. <2.cndot.2.sup.-m .ltoreq.2.vertline.C(p)-C(q).vertline..sup.1/n. According to the invention, the position of the query point in each order is determined by using a generalized Gray code calculation. The calculation of bits is stopped as soon as the position of the query point in the order relative to the database points has been determined. For example, consider a search based on a single one-dimensional ordering or encoding. Suppose C:[0,1).sup.n .fwdarw.[0,1) is a mapping having the following characteristics: (i) C is one-to-one, and (ii) the inverse mapping C.sup.-1 :I.fwdarw.[0,1).sup.n is continuous, where I.OR right.[0,1) is the image of C. Because C is one-to-one into an interval, C defines an order on [0,1).sup.n : .A-inverted.x,y.EPSILON.[0,1).sup.n .times.<.sub.c yC(x).ltoreq.C(y) Let P={p.sub.1, . . . , p.sub.N }.OR right.[0,1).sup.n. The linear ordering can be implemented as a simple relation R with only two attributes: an point-id and a value. Each p.sub.i has one entry in R where point-id equals i and value equals C(p.sub.i). A query consists of two components: a point q.EPSILON.[0,1).sup.n and a number k.EPSILON.N. An exact answer to the query consists a set O.EPSILON.P of size k so that .A-inverted.x.EPSILON.O.A-inverted.y.EPSILON.(P.backslash.O).parallel.x-q.p arallel..ltoreq..parallel.y-q.parallel.. The relation R is used for obtaining a set of candidates from P. For every .delta.>0 define R(.delta.;q) .OR right.P as R(.delta.;q)={p.EPSILON.P.vertline..vertline.C(p)-C(q.vertline..ltoreq..del ta.} Because C.sup.-1 is continuous, the points in R(.delta.) are likely to be close in the space [0,1).sup.n. A good implementation for one such index in a conventional DBMS is to form a B-tree index in attribute value. The definition of C lends itself to a multitude of other possible one-dimensional orderings or encodings because the dimensions 1, . . . , n can be arbitrarily permutated and each permutation defines a different ordering. Sub-cubes that are distant relative to one permutation may be close relative to another one. The sub-cubes hierarchy itself, however, is the same for all the permutations. In order to obtain further orderings that do not share the same hierarchy, a new ordering is defined that is based on C by using a translation vector .epsilon..EPSILON.[0,1/3}.sup.n. A mapping C.sub..epsilon. :[0,1).sup.n .fwdarw.[0,1) is defined by C.sub..epsilon. (x)=C(3/4(x+.epsilon.)). For each .epsilon., a candidate set is defined by R(.delta.,.epsilon.;q)={p.EPSILON.P.vertline..vertline.C.sub..epsilon. (p)-C.sub..epsilon. (q.vertline..ltoreq..delta.}. A fixed number of orderings are used, using a positive parameter .delta. as follows: 1. Select l vectors .epsilon..sub.1, . . . , .epsilon..sub.l. 2. For each ordering, define a relation R.sub.i similar to R, based on C.sub..epsilon.i, instead of C. 3. For a query (q,k), return a candidate set ##EQU2## 4. For each point in R(.delta.;q), calculate its distance from q and finally return the k closest points. The sole FIGURE shows a process 10 for determining k nearest-neighbors to a query point in a database according to the present invention. Consider a query according to the present invention in a conventional DBMS. At step 11, an ordering is defined for a data set P of a database. The ordering is based on e one-dimensional codes C.sub.1, . . . , C.sub.l. Preferably, the step of defining an ordering for a data set P of a database forms a plurality of cubes. At step 12, single relation R is created in which R has the attributes of index-id, point-id and value. At step 13, an entry (j,i,C.sub..epsilon.j (p.sub.i)) is included in relation R for each data point p.sub.i .EPSILON.P, where index-id equals j, point-id equals i, and value equals C.sub..epsilon.j (p.sub.i). A B-tree index is created at step 14 that is based on a combination of the index-id attribute and the value attribute. A query point is received at step 15 and at step 16 a relation Q is created for the query point having the attributes of index-id and value. At step 17, one tuple is generated in the relation Q for each j, j=1, . . . l, where index-id equals j and value equals C.sub..epsilon.j (q). A distance d is selected at step 18. At step 19, the index-id attribute for the relation R of each data point p.sub.i is compared to the index-id attribute for the relation Q of the query point. A candidate data point p.sub.i is selected at step 20 when the comparison of the relation R of a data point p.sub.i to the index-id attribute for the relation Q of the query point is less than the distance d. Lower bounds can be calculated for each cube that represent a minimum distance between any point in a cube and the query point. Thus, the step of comparing, step 19, is terminated when no lower bound is less than a distance between the query point and any of the candidate data points. At step 19, k candidate data points p.sub.i are selected as k nearest-neighbors to the query point. The following exemplary pseudocode in SQL illustrates a query according to the present invention. Of course, other query languages can also be used. SELECT DISTINCT point-id FROM R,Q WHERE R.index-id=Q.index-id AND R.value<=Q.value+d AND R.value>=Q.value-d The SQL query returns R(d;q). If the results are satisfying, the k nearest neighbors can be found from the candidate set. Otherwise, a larger candidate set can be obtained by increasing d. The new candidate set will be a superset of R(d;q). A set of candidate near neighbors is extracted from each B-tree by simply selecting the identification of the points that are within, for example, 30 positions from the query point in the tree. The set of candidate identifications contains many duplicates, so the duplicate points are eliminated. The number of points from each B-tree remaining after eliminating duplicate points may then need to be increased so the number of candidates is at least k. As previously indicated, the set of candidates that is obtained as the union of the sets of candidates provided by the various orderings is not guaranteed to contain all the k nearest neighbors. In many cases an extremely good approximation is achieved, either in terms of the overlap with the set of the true k nearest neighbors, or in terms of the actual distances between the query point and the reported neighbors when compared to the distances between the query point and the set of the true k nearest neighbors. The present invention also provides a technique for finding the true k nearest neighbors. This extended feature is used when it is essential to get the exact result. If an exact output is desired, the search must be extended and a stopping criteria be defined so that when the algorithm stops it is guaranteed to provide the exact result. The method works as follows. Candidates are obtained by enlarging the scanned intervals within each of the orderings. At the same time, lower bounds are calculated for various cubes, giving the minimum distance between any point in the cube and the query point. The algorithm stops the search as soon as the available bounds imply that no unchecked database point is closer to the query than any of the current k nearest neighbors. The sub-cubes are stored in the same trees that store the linear orders. If q=(q.sub.1, . . . , q.sub.n) is a query point and B=B(a,b)={x.EPSILON.R.sup.n .vertline.a.sub.i .ltoreq.x.sub.i .ltoreq.b.sub.i, i=1, . . . , n} (a,b .EPSILON.R.sup.n) is a rectangular box, then the Euclidean distance between q and B can be computed as follows. The nearest point of the cube is obtained by minimizing .SIGMA..sub.i=1.sup.n (q.sub.i -x.sub.i).sup.2 subject to a.sub.i .ltoreq.x.sub.i .ltoreq.b.sub.i (i=1, . . . , n). The objective function is separable, so the optimal solution can be described as follows: (i) If q.sub.i .ltoreq.a.sub.i, then let d.sub.i =a.sub.i -q.sub.i ; (ii) If a.sub.i .ltoreq.q.sub.i .ltoreq.b.sub.i, then let d.sub.i =0; and (iii) If qi.gtoreq.b.sub.i, then let d.sub.i =q.sub.i -b.sub.i. The distance between q and B is given by .delta.(q,B)=.parallel.d.parallel., where d=(d.sub.1, . . . , d.sub.n). Suppose for a given q there are k points in the database that are nearest to q that are desired to be obtained. Suppose further the database points p.sup.1, . . . , p.sup.k have already been located as candidates for the k nearest neighbors, such that .parallel.q-p.sup.1 .parallel.<. . . <.parallel.q-p.sup.k .parallel.. .parallel.q-p.sup.k .parallel. is an upper bound on the distance to the kth nearest neighbor. Thus, if some box B contains database points, and .delta.(q,B)>.parallel.q-p.sup.k .parallel., then there is no need to search box B because none of its points can be among the k nearest neighbors. Such bounds become useful in case an exact output is desired. Each of the linear orderings are maintained in a tree where nodes store information about bounding boxes, and the ordering lends itself to creating good boxes. Let v.sup.1, . . . , v.sup.N be all the database points. Consider one ordering induced by a code C, so, without loss of generality, assume that C(v.sup.1)<. . . <C(v.sup.N). For each i (i=1, . . . , N), use the representation ##EQU3## Thus, the components v.sup.i1, . . . , v.sup.N1 determine the first level of the ordering. Recall that S(1) denotes the collection of the sub-cubes at the first level of the hierarchy, so it has 2.sup.n members, each with a volume of 2.sup.-n. Because N is expected to be much smaller than 2.sup.n, most of the members of S(1) will not contain any database point. The members of S(1) that do contain such points can be stored at the leaves of a balanced tree according to the order that is induced on them by the codes J(vi1), that is, the inverse of the Gray code. Due to the nature of this code, if two sub-cubes are close in the ordering, then there exists a relatively small box that contains their union. For example, any two consecutive members form a box of volume 2.sup.-n+1, and any two members that are at distance 2 in the ordering are contained in a box of volume 2.sup.-n+2. In general, if the two children of a node in the tree hold the boxes B(a.sup.1,b.sup.1) and B(a.sup.2,b.sup.2), then the parent holds the box B(a,b) where a.sub.i =min(a.sub.i.sup.1,a.sub.i.sup.2) and b.sub.i =max(b.sub.i.sup.1,b.sub.i.sup.2). Thus, relying on the currently known upper bounds on the distance to the kth nearest neighbor, the present invention can determine not to proceed into the subtree rooted at the node, if the box associated with it is sufficiently distant from the query point. During the search for the exact k nearest neighbors, the present invention maintains the following information. First, there is a global upper bound on the distance between the query point and the kth nearest neighbor. Second, in each tree, there is the current interval in the corresponding ordering all of whose points have been checked. Moreover, in each such tree some of the nodes are marked as ones whose subtrees should not be checked. The intervals are expanded repeatedly, and the upper bound is updated. At the same time, lower bounds are updated for a collection of nodes having subtrees that cover the unchecked data points. In each tree, the number of such subtrees at any time does not exceed O(log n) where n is the number of data points. The search may stop when in one tree all the lower bounds of subtrees that cover the unchecked points are greater than the global upper bound. While the present invention has been described in connection with the illustrated embodiments, it will be appreciated and understood that modifications may be made without departing from the true spirit and scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
