System and method for similarity searching in high-dimensional data space6289354Abstract Information is analyzed in the form of a plurality of data values that represent a plurality of objects. A set of features that characterize each object of the plurality of objects is identified. The plurality of data values are stored in a database. Each data value corresponds to at least one of the plurality of objects based on the set of features. Ones of the plurality of data values stored in the database are partitioned into a plurality of clusters. Each cluster of the plurality of clusters is assigned to one respective node of a plurality of nodes arranged in a tree hierarchy. Ones of the plurality of nodes of the tree hierarchy are traversed. If desired, information may be analyzed for finding peer groups in e-commerce applications. Claims What is claimed: Description FIELD OF THE INVENTION
TABLE 1
Nearest-Neighbor Search with Branch and Bound
Procedure BranchAndBound(Target:t; TreeHierarchy: T)
Begin
Initialize CurrentUpperBound to the minimum of the distances
from Target t to a random sample of data values in the data-space;
Examine the nodes in the tree hierarchy T one by one in a given
order (e.g. depth first search);
for each node n do
begin
if node n is a leaf then
begin
for each data value p indexed by node n
if Distance(p, t) < CurrentUpperBound then
CurrentUpperBound = Distance(p, t);
End
Else
Begin
LB(n) = lower bound of distance from target
t to the Cluster indexed by node n;
UB(n) = upper bound from target t to the
cluster Indexed by node n;
if UB(n) < CurrentUpperBound then
CurrentUpperBound =
UB(n);
if LB(n) > CurrentUpperBound then node n
and all its sub-nodes can be pruned;
end
end
end
Note that with minor adjustments the above methodology may also be applied to performing `furthest-neighbor` searches. ESTIMATING (LOCAL) BOUNDS BY ENCAPSULATION The use of branch and bound methods for nearest-neighbor searching has been explored, for example, in Roussopoulos N., Kelley S., and Vincent F., supra. Conventional methods, such as the use of R-trees (see for example Roussopoulos N., Kelley S., and Vincent F., supra) rely on indexing arrangements which partition the data-space into minimum bounding rectangles oriented along predefined directions. The direction of orientation of the rectangles is typically parallel to the axes. The use of such minimum bounding rectangles allows for relatively quick calculation of the (local) lower and/or upper bounds. It is relatively simple to estimate the nearest and/or farthest distances between such rectangles and a target value. Hence, such bounding rectangles facilitate calculation of an estimate for a (local) upper and/or lower bound of the distance from a cluster of data values to a target value. This approach, however, is problematic in that such minimum bounding rectangles may occupy significantly more volume (in the data-space) than the volume occupied by the enclosed cluster. Thus, the use of bounding rectangles may degrade the "tightness" of the bounds. For example, FIG. 4 shows an estimate of the (local) lower bound from a target value t, obtained using a bounding rectangle to encapsulate cluster C. FIG. 5 shows an estimate of the (local) lower bound from a target value t, obtained using a bounding ellipse to encapsulate cluster C. To improve the calculation of "tight" (local) lower bound estimates, of the distance between a target point and a cluster, it may be desirable to approximate the shape of a cluster by an ellipsoid shape oriented along an arbitrary direction. Further, it may also be desirable to approximate ellipsoidal bounding capsules using an intersection of hyperspheres. FINDING RECTANGULAR CAPSULES FOR CLUSTERS Suppose that cluster C can be written as {x.sub.i.di-elect cons.{character pullout}.sup.n :i=1, . . . , m}, i.e. cluster C includes m data values in an n dimensional data-space. Let the centroid of cluster C be given by ##EQU4## In high dimensional data-spaces it may be the case that data values are sparse and correlated. Correlation between data values may cause the values to align along certain hyperplanes. To find rectangular capsules one may first seek to find at most n hyperplanes associated with cluster C. These n hyperplanes may be found by calculating the eigenvectors of the covariance matrix for the cluster, where the covariance matrix is given by ##EQU5## By the Principal Axes Theorem, a (real) symmetric matrix (of order n) has n real eigenvectors that form an orthonormal basis of {character pullout}.sup.n, say a.sup.1, . . . , a.sup.n. The eigenvectors may either be derived from a constructive proof of the Principal Axes Theorem, or, for example, by use of Singular Value Decomposition (SVD). The n hyperplanes sought may be defined by a.sup.i.multidot.x=b.sup.i (annotated herein [a.sup.i,b.sup.i ]), i=1, . . . ,n, where b.sup.i are given by a.sup.i.multidot.x=b.sup.i,i=1, . . . ,n. (5) Let r.sup.i denote the average (orthogonal) distance from the data values of cluster C to the hyperplane [a.sup.i,b.sup.i ]. For .epsilon.>0, let R.sub.i.sup..epsilon. (C) be the region of the data-space enclosed between [a.sup.i,b.sup.i -r.sup.i.multidot..epsilon.] and [a.sup.i,b.sup.i +r.sup.i.multidot..epsilon.]. A rectangular capsule parameterized by .epsilon. is given by ##EQU6## The parameter .epsilon. controls the size of rectangular capsule R.sup..epsilon. (C). For each data value x.sub.i.di-elect cons.C, let the value .epsilon.(x.sub.i) denote the smallest rectangular capsule such that x.sub.i.di-elect cons.R.sup..epsilon.(x.sup..sub.i .sup.) (C). Let ##EQU7## the average value of .epsilon.(x.sub.i) for all point in C, and let ##EQU8## the standard deviation of the values of .epsilon.(x.sub.i) for all points in c. Let ##EQU9## A rectangular bounding capsule for cluster C, R.sup.* (C), may be chosen as the smaller of R.sup..epsilon..sup..sup.* (C) and R .epsilon.+3.multidot..sigma.(C). R.sup.* (C) may contain most of the data values in cluster C. The data values of cluster C which are not contained in R.sup.* (C) may be considered "outliers", and may be handled separately. Since rectangular capsule R.sup.* (C) is aligned along hyperplanes so as to minimize average distances to data values of cluster C, the volume of the capsule may be relatively small. As explained above, minimizing the volume of the capsule, while capturing most of the data values in the associated cluster, helps to calculate "tight" (local) upper and lower bound estimates. "Tighter" upper and lower bound estimates may result in more efficient pruning, in the course of performing the branch and bound method. Each cluster in a partition may be associated with a coordinate transformation T(C) (rotations and translation). T(C) transforms the coordinate system such that the rectangular capsule C is aligned with the axes in the positive quadrant, with one corner at (0, 0, 0 . . . 0), in the new coordinate system. Such a transformation may be uniquely determined by: the centroid of the given cluster, the set of orthogonal hyperplanes ([a.sup.i,b.sup.i ], i=1, . . . , n), and the volume of rectangular capsule R.sup.* (C). Note that the memory space needed to store the transformation T(C) is proportional to the number of dimensions n, as is the computation time to transform any data value to the new coordinate system. FINDING ELLIPSOIDAL CAPSULES FOR CLUSTERS Rectangular capsules have the disadvantage that when the Euclidean metric is used to evaluate distances, the rectangular capsule may not well approximate the shape of a cluster. Hence, it may be desirable to find spherical type shapes to encapsulate clusters. Ellipsoidal capsules may provide better (local) lower bound estimates on distances to clusters. As in the foregoing, hyperplanes may be determined to assist in aligning the principal axes of the ellipsoids. The axes of an ellipsoid may be set to be the normals a.sup.l, . . . , a.sup.n of the hyperplanes generated as discussed above. The center of an ellipsoid encapsulating cluster C may be set to coincide with the centroid x. For simplicity the equation of ellipsoids is written herein in the coordinate system defined by the orthogonal hyperplanes. As in the above, let r.sup.i denote the average (orthogonal) distance from the data values of cluster C to the hyperplane [a.sup.i, b.sup.i ]. An ellipsoidal capsule, written E.sup..epsilon. (C), for cluster C, parameterized by the variable .epsilon., is given by: ##EQU10## As in the foregoing, parameter .epsilon. controls the size of ellipsoidal capsule E.sup..epsilon. (C). For each data value x.sub.i.di-elect cons.C, let the value .epsilon.(x.sub.i) denote the smallest ellipsoidal capsule such that x.sub.i.di-elect cons.E.sup..epsilon.(x.sup..sub.i .sup.) (C) . Let .epsilon., .sigma., and .epsilon..sup.* be defined as in the foregoing. A ellipsoidal bounding capsule, E.sup.* (C), for cluster C may be chosen as the smaller of E.sup..epsilon..sup..sup.* (C) and R.sup..epsilon.+3.multidot..sigma. (C). E.sup.* (C) may contain most of the data values in cluster C. The data values of cluster C which are not contained in E.sup.* (C) may be considered "outliers", and may be handled separately. CALCULATING (LOCAL) UPPER AND LOWER BOUNDS--DISTANCE TO A CAPSULE While searching the tree hierarchy, it may be desirable to calculate estimates for the upper and lower bounds, respectively, of the distance from a target value t to a particular cluster C. For each cluster, assigned to a node of the tree hierarchy, a transformation T(C) is applied target value t. T(C) transforms the target t to a new coordinate system in which the capsule is aligned with the axes. Given a capsule the maximum or minimum distances to the capsule from the target value t may be determined by solving a constrained optimization problem. For a metric (that is invariant under the transformation T(C)), such as, for example, the Euclidean metric, the problem(after coordinate transformation) may be defined as follows: ##EQU11## where parameter .epsilon. determines the volume of the capsule, and where z=(z.sub.i,z.sub.2, . . . , z.sub.n).sup.T. The notation {character pullout} designates the metric. In the case of the Euclidean metric, for example, ##EQU12## where t=(t.sub.i,t.sub.2, . . . ,t.sub.n).sup.T. The solution to case (A) (rectangular capsules) is given in table 1 below.
TABLE 2
Value of t.sub.i Min value for z.sub.i Max value for z.sub.i
t.sub.i < 0 z.sub.i = 0 z.sub.i = 2 .multidot. r.sub.i
.multidot. .epsilon.
0 .ltoreq. t.sub.i .ltoreq. 2 .multidot. r.sub.i .multidot. .epsilon.
z.sub.i = t.sub.i z.sub.i = max{t.sub.i, 2 .multidot. r.sub.i .multidot.
.epsilon. - t.sub.i }
t.sub.i > 2 .multidot. r.sub.i .multidot. .epsilon. z.sub.i = 2 .multidot.
r.sub.i .multidot. .epsilon. z.sub.i = 0
The solution to the ellipsoidal case, case (B), the following optimality criterion may be obtained by the application of the Kuhn-Tucker conditions. These yield, ##EQU13## where .lambda. is free. To solve for .lambda., and hence the distance, the right hand side of equation (9) may be substituted into condition (B) above. The resulting equation may then be solved for .lambda.. Such an equation, however, may be difficult to solve for .lambda. in closed form. BOUNDING CAPSULES AS INTERSECTIONS OF HYPERSPHERES It may be desirable to bound clusters with capsules that are formed by an intersection of hyperspheres. These bounding capsules may be considered a variation of ellipsoidal capsules. First the orthogonal hyperplanes [a.sup.i,b.sup.i ] (i=1, . . . , n) are calculated as above. Without loss of generality, one may assume the average distance r.sub.i from the data values of cluster C to the i.sup.th hyperplane are sorted. In other words, one may assume that r.sub.1.ltoreq.r.sub.2.ltoreq.. . . .ltoreq.r.sub.n. The set {r.sub.1,r.sub.2, . . . , r.sub.n } into k.ltoreq.n mutually disjoint subsets, {r.sub.1, . . . , r.sub.j.sub..sub.1 }, {r.sub.j.sub..sub.1 .sub.+1, . . . , r.sub.j.sub..sub.2 }, . . . , {r.sub.j.sub..sub.k-1 .sub.+1, . . . , r.sub.n }. Further, j.sub.0 =1.ltoreq.j.sub.1.ltoreq.. . . .ltoreq.j.sub.k-1.ltoreq.n=j.sub.k may be chosen such that the minimum to maximum ratio of values of r.sub.i in any subset is no larger than 2, i.e. ##EQU14## For example, consider a 10-dimensional space in which the set {r.sub.1,r.sub.2, . . . , r.sub.n }={1, 2, 3, 5, 8, 12, 16, 22, 26, 40}. In this case, {r.sub.1, r.sub.2, . . . , r.sub.n } may be divided into the following four subsets: {1, 2 }, {3, 5}, {8, 12, 16}, and {22, 26, 40}. For each of the subsets it is possible to introduce spherical constraints, rather than the elliptical type constraint introduced in equation (7). Thus, cluster C may be approximated by a bounding capsule having a volume defined by following constraints: ##EQU15## where .epsilon..sub.1, .epsilon..sub.2, . . . , .epsilon..sub.k may be determined as discussed in relation to rectangular and ellipsoidal capsules above, and the constraints are in the proper coordinate system. As discussed above it may be desirable to form bounding capsules with as small a volume as possible, while enclosing as many data values of the associated clusters as possible. Thus, in general, it is desirable to form equations (11) with .epsilon..sub.1, .epsilon..sub.2, . . . , .epsilon..sub.k as small as possible, while enclosing as many data values of the associated cluster as possible. Let t=(t.sub.i, t.sub.2, . . . , t.sub.n).sup.T denote a target value (possibly after coordinate transformation). The nearest or farthest distance to the surface of the convex shape may be calculated by solving the following optimization problem: ##EQU16## This optimization problem may be decomposed into k different sub-problems. Each sub-problem involves optimization (maximization or minimization) of a distance from a point to a spherical surface. The sub-problems may be solved in closed form, and hence implemented efficiently with a branch and bound method. In 3 dimensions the shapes created by the above division of the set {r.sub.1, r.sub.2, r.sub.3 } may be spherical (ball like), cylindrical, disc like, or rectangular (box like) depending on the r.sub.i 's. Suppose, r.sub.1 <r.sub.2 <r.sub.3. FIG. 6(a) illustrates the case where ##EQU17## and the set {r.sub.1, r.sub.2, r.sub.3 } is not divided. In this case, the shape generated is ball like. FIG. 6(b) illustrates the case where ##EQU18## and the set {r.sub.1,r.sub.2,r.sub.3 } is divided into the subsets {r.sub.1,r.sub.2 } and {r.sub.3 }. In this, case the shape generated is cylindrical. FIG. 6(c) illustrates the case where ##EQU19## and the set {r.sub.1,r.sub.2,r.sub.3 } is divided into the subsets {r.sub.1 } and {r.sub.2,r.sub.3 }. In this case, the shape generated is disc like. FIG. 6(d) illustrates the case where ##EQU20## and the set {r.sub.1,r.sub.2,r.sub.3 } is divided into the subsets {r.sub.1 }, {r.sub.2 }, and {r.sub.3 }. In this case, the shape generated is rectangular. For example, when {r.sub.1,r.sub.2,r.sub.3 }={3,9,12}, the set may be divided into {3} and {9,12}. Therefore the bounding capsule for the cluster is a disc like shape defined by z.sub.1.sup.2.ltoreq..epsilon..sub.1.sup.2 z.sub.2.sup.2 +z.sub.3.sup.2.ltoreq..epsilon..sub.2 .sup.2. CALCULATION OF RADII FOR HYPERSPHERES The radii of the hyperspheres for cluster encapsulation (i.e. .epsilon..sub.i 's in equation (11) above) may be calculated by a method similar to that detailed in the foregoing discussion of rectangular and ellipsoidal capsules. In the case of rectangular and ellipsoidal clusters, the scalar .epsilon.+3.multidot..sigma. was used as an upper bound to control the size of capsule R.sup.* (C) or E.sup.* (C). This bound may not be appropriate for spherical capsules, as it may lead to capsules with a volume that is significantly greater than the approximated cluster. The scalar .epsilon.+3.multidot..sigma. is an appropriate bound for the size of a capsule if one assumes that the distances between the centroid and the data values in the cluster are normally distributed. For hypersphere like clusters, however, many of the data values may be concentrated near the outer shell or surface of the hypersphere. Thus, for hyperspheres, the assumption of normally distributed distances from the centroid may not be accurate. For example, in a 10-dimensional data space, the ratio of the volume of a sphere with radius 0.9 r and r, is given by 0.9.sup.10.apprxeq.0.348. Hence, approximately 65% of the volume of the sphere with radius r is in the outer shell of the sphere, and is associated with 10% of the radius. Accordingly, assuming that many of the data values of a spherical cluster reside near the outer shell may be a proper assumption. Suppose that data values of a cluster are uniformly distributed about the centroid. It is possible to derive the following result. Let S be a sphere of radius R.sub.0 in an n-dimensional data space containing N uniformly distributed data values. The mean .mu. and the standard deviation .sigma. of the distance of the data values from the center of the sphere are given by ##EQU21## The above result shows that in high-dimensional spaces .mu.+.sigma. is a good approximation of the radius R.sub.0 of the sphere S. Therefore, assuming that points in a cluster are uniformly distributed, the above proposition shows that .mu.+.sigma.may be a good approximation for the radius of an encapsulation sphere. FIG. 7 is a flow chart diagram which is useful for explaining a method of analyzing information in a database in accordance with an exemplary embodiment of the present invention. Steps 710 through 740 of FIG. 7 are identical to steps 110 through 140 of FIG. 1. In step 750 each cluster is bounded by an approximating capsule, as detailed in the foregoing. In step 760 the tree hierarchy is traversed to detect p nearest-neighbors to q target value(s). In other words, a branch and bound method may be applied to the tree hierarchy to detect p data values in the database that are nearest, in terms of a given distance measure, to q target value(s). Distances from target value(s) to clusters (assigned to nodes in the tree hierarchy) are estimated by calculating distances to the surface of the bounding capsules found in step 760. The target value(s), the number of nearest-neighbors sought, and the distance measure may be specified by a user. Note that when multiple target values are specified and/or multiple nearest-neighbors are sought, then the distance from the target(s) to the nearest-neighbor(s) may be defined, for example, as the average distance. Thus, the distance between q target value(s) {t.sup.1, . . . , t.sup.q }, and p nearest-neighbor(s) {y.sup.l, . . . , y.sup.p } (or nearest-neighbor candidates) may be given by ##EQU22## Moreover, the calculation of distances may be performed with respect to a subspace defined by a user specified subset of interesting features F. FIG. 8 is a flow chart diagram of an exemplary recipe for hierarchically partitioning a set of data values into clusters. In step 810 a list is created containing a root cluster. The root cluster may be, for example, a cluster containing all the data values in the database. In step 820 the first cluster is chosen from the list and assigned to a node in a tree hierarchy. The first cluster may itself be partitioned, in turn, into another set of clusters (sub-clusters) in step 830. In step 840 the first cluster is removed from the list. Additionally in step 840, each sub-cluster generated in step 830, which includes less than a predetermined number of data values (say s), is added to the list. In step 850 a decision is made as to whether the list has emptied. If the list has emptied the procedure terminates in step 860. If the list still contains cluster then step 820 through 850 are repeated. FIG. 9 is a flow chart diagram illustrating an exemplary recipe for performing a branch and bound method to detect nearest-neighbor data value(s) to target value(s). In step 910 an initial estimate is calculated for the (global) upper bound, say UB. That is, UB is the estimated (global) upper bound on the distance between the p nearest neighbor(s) to the q target value(s). UB may be calculated by taking a random sample of data values in the database and calculating the distance to the target value(s). As noted above, the distance is calculated based on a distance measure, and a subset of interesting attributes provided by a user. In step 920 the nodes of the tree hierarchy are traversed and a current node is chosen according to the order of descent into the tree. In step 930 a decision is made as to whether the current node is a leaf of the tree hierarchy. If the current node is a leaf step 940 is proceeded with, whereas if the current node is not a leaf step 950 is proceeded with. In step 940 the distance from each data value, included in the cluster assigned to the current node, to the target value(s) is computed. If the least of these distances, say L, is less than UB then UB is reset to L. In step 950 an estimate of the (local) lower bound on the distance between the target value(s) and the bounding capsule, for the cluster assigned to the current node, is calculated. In step 970 a decision is made as to whether the estimate of the (local) lower bound on the distance from the target value(s) to the bounding capsule is greater than UB. If the (local) lower bound is greater than UB then the current node (and all of its sub-nodes) may be pruned from the tree hierarchy in step 980. Otherwise, a decision is made in step 990 as to whether the entire tree hierarchy has been traversed. If the entire tree has been traversed then the procedure terminates in step 995. Otherwise, steps 920 through 990 are repeated. The recipe given by the flow chart diagram of FIG. 9, represents a method that may be performed in response to a user querying a database in search of P nearest neighbor data value(s) to q target value(s). Once p nearest neighbor(s) are detected in the database, a user may choose to narrow the search further. In other words, a user may choose to "zoom in" to detect p'<p nearest neighbors to the same target(s). In this case, the procedure given by the flow chart diagram of FIG. 9 may be repeated except that UB is initially set (step 910) to the distance between the, previously detected, p nearest neighbors, and only the portion of the tree hierarchy yet unpruned (step 920) need be traversed. In other words, the branch and bound method executed to detect p' nearest neighbors may begin with the state of the tree hierarchy at the end of the branch and bound method executed to detect p nearest neighbors. A user querying a database may also choose to widen the search for p">p nearest neighbors. In other words, a user may choose to "zoom out" to detect p">p nearest neighbors to the same target(s). In this case, the procedure given by the flow chart diagram of FIG. 9 may be repeated except that the portion of the tree hierarchy yet unpruned may be traversed first to detect p" nearest neighbors to the target(s). UB may then be set to the distance between the p" nearest neighbors detected and the target(s). Second, the entire tree hierarchy may be traversed again, using the branch and bound method, with UB as an initial (global) upper bound. FIG. 10 is a flow chart diagram illustrating an exemplary recipe for calculating the distance between the target value(s) and a surface of a bounding capsule. In step 1010 the target value(s) are transformed to a new coordinate system. As discussed in the foregoing, the calculation of equation(s) representing the surface of bounding capsules may be performed in a new coordinate system in which the capsule is aligned with the coordinates. Thus, for ease of computation of distances, it may be advantageous to transform the target value(s) to the new coordinate system. In step 1020 an estimate of the lower bound on the distance between target value(s) and the surface of the bounding capsule is calculated in the new coordinate system. FIG. 11 is a block diagram of an apparatus for analyzing information over a computer network in accordance with an exemplary embodiment of the present invention. In the embodiment depicted in FIG. 11, multiple client computers 1102 may access a server 1106, for example a Web server, over a network 1104. Server 1106 may have a data memory 1110 as well as a cache 1108. The server may further include a Central Processing Unit (CPU) 1112 for processing information and a disk 1114 for storing data. Data values may be collected from client computers 1102 by server 1106 over network 1104. Clients 1102 may also query server 1106 regarding the information stored in data memory 1110 and disk 1114. In particular, a client computer may supply server 1106 with q target value(s), a distance measure, a subset of interesting features, as well as the number of nearest-neighbors sought p. Server 1106 detects p nearest-neighbors, stored in a database on disk 1114 and in data memory 1110, and sends the analysis results back to the client computer. The results of the analysis and the nearest-neighbors detected may be displayed to a user at the client computer end, for example, either in the form of text or in graphical form. E-COMMERCE EXAMPLE The following definitions are helpful in understanding the e-commerce example: e-commerce: e-commerce denotes Electronic Commerce. Electronic commerce refers to the process of trading goods and commodities by electronic means, for example, purchases made over the Internet. The increasing popularity of the Internet has made e-commerce applications of commercial importance. One aspect e-commerce applications is the availability of transaction histories for customers. Databases may be maintained for different consumers. The information stored in these databases may be analyzed for the purpose of providing product recommendations to customers concerning, for example, items for sale. Recommendations: In an e-commerce application it may be desirable to predict the purchasing characteristics of individual customers. Purchasing behavior predictions may be made by analyzing information stored in a database of transaction records. Computer technology enables databases of information regarding a customer's transaction history to be maintained. It may be desirable, for example for marketing purposes, to predict the types of products which may be preferred by a customer. Information stored in customer transaction databases may be analyzed to detect customers that exhibit "similar" purchasing behavior. The set of customers who are similar to a given target customer is referred to as a "peer group". Based on the purchasing behavior of the "peer group" it may be possible to predict the types of products which may be preferred by a target customer. The predicted (preferred) items may be provided to the target customer as a list of recommended items. Promotion list: A promotion list is a list of items used to restrict the number of products which may be recommended to a target customer. For example, although a hundred thousand items may be offered for sale by a retail organization, it may be desirable to restrict the recommended items to one hundred of the items. This restricted list is referred to as a promotion list. A promotion list may be used, for example, in order to promote the sale of specific products. Peer group: A peer group is a group of L data values (transaction records) "similar" to m given target values. For example, each data value (transaction record) may correspond to a transaction or purchase made by a customer. Data values (transaction records) may be associated with items for sale. For example, a data value may be an n-tuple of quantities of items purchased a particular customer; i.e. n quantities (attributes) each corresponding to a different item for sale or product category. As a peer group is "similar" to m given target values, the purchasing patterns of the peer group may be used to make recommendations. Collaborative filtering applications: collaborative filtering applications are applications in which the behavior of a peer group of target values may be used to provide recommendations. Consider, for example, a database containing information regarding the purchasing behavior of different supermarket customers. For each customer a database transaction record (data value) exists indicating the quantity of each item purchased. Hence, a transaction record may be an n-tuple of real numbers indicating the quantity of the item purchased (e.g. <3,3.96,100,0> corresponding to 3 golden delicious apples, 3.96 pounds of ground beef, and 100 grams of dark chocolate, and no muffins purchased). A target customer may thus be represented by an n-tuple of real numbers. A peer (nearest-neighbor) group of the target customer may be found by identifying n-tuples of real numbers in the database which are "similar", as discussed in the foregoing. L "similar" (or related) n-tuples comprising the peer group may be detected. Based on the purchases indicated by the peer group, recommendations may be made to the target customer. FIG. 12 is a flow chart diagram illustrating an exemplary interactive collaborative filtering application. In step 1210 L data values (or transactions) that are comparable ("similar") to q target values are detected. Both the target values and the integer L may be defined by a user. The L data records may be detected, for example, using the method illustrated in the foregoing and in FIG. 7. In step 1220 a set of items which are associated with a nonzero quantity in some or all of the L detected data values is found. The set of items associated with a nonzero quantity (i.e. purchased by some or all of the peer group) may be, for example, identified as the set of items purchased in a predetermined percentage of the L detected data values. In step 1230 the set of items associated with a nonzero quantity is intersected with a promotion list. The items in the intersection may be, for example, provided to a customer. Note that the collaborative filtering application of FIG. 12 may be made interactive by allowing a user to vary L, for a fixed set of target values, for each repetition of steps 1210, 1220, and 1230. Step 1240 may be added to allow a user to interactively input L (the number of nearest-neighbors sought). For example, suppose that for a target customer each of the nearest neighbors purchased at least one loaf of bread, one container of butter, one jar of jam, and one bottle of milk. A set of items which are associated with a nonzero quantity in all of the L detected data values is {Bread, Butter, Jam, Milk}. Suppose further that a supermarket chooses a promotion list {Bread, Milk, Sausage}. Then the items in the intersection, namely, {Bread, Milk}={Bread, Butter, Jam, Milk}.andgate.{Bread, Milk, Sausage}, may be recommended to the target customer. FIG. 13 is a flow chart diagram illustrating another exemplary collaborative filtering application. Given q user defined target values, in step 1310 L data values (or transactions) that are comparable ("similar") to q user defined target values are detected. The value of L may be initialized by a user. In step 1320 a set of items which are purchased by some or all of the L detected data records is found and is intersected with a promotion list. The total number of items in the intersection (preferred items to be provided to a customer) is stored. In step 1330 the total number of items in the intersection is compared with a given number, say r. If the total number of items is less than r then in step 1360 L is incremented, and steps 1310, 1320, and 1330 are repeated. Otherwise, a decision is made in step 1340 whether the total number of items is greater than r. If the total number of items in the intersection is greater than r then in step 1350 L is decremented, and steps 1310, 1320, and 1330 are repeated. If the number of items in the intersection is r, then these items are provided (recommended) to a customer in step 1370. One of ordinary skill in the art will readily recognize how to modify the method of FIG. 13, so that items may be provided to a customer if the number of items in the intersection is within a tolerance level of r, i.e. r+e.sub.tl. Although illustrated and described herein with reference to certain exemplary embodiments, the present invention is nevertheless not intended to be limited to the details shown. Rather, various modifications may be made in the details within the scope and range of equivalents of the claims and without departing from the spirit of the invention.
|
Same subclass Same class Consider this |
||||||||||
