Automatic recommendation of products using latent semantic indexing of content6615208Abstract Techniques for using latent semantic structure of textual content ascribed to the items to provide automatic recommendations to the user. A user inputs a selected item and, in turn, a latent semantic algorithm is applied to the user selection and the textual content of the items in a database to generate a conceptual similarity between the selection and the items. A set of nearest items to the selected item is provided as a recommendation to the user of other items that may be of particular interest or relevance to the user's original selection based upon the conceptual similarity measure. Claims What is claimed is: Description BACKGROUND OF THE DISCLOSURE
TABLE 1
c1 Human machine interface for Lab ABC computer applications
c2 A survey of user opining of computer response time
c3 The EPS user interface management system
c4 Systems and human systems engineering testing of EPS-2
c5 Relation of user-perceived response time to error measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: widths of tress and well-quasi-ordering
m4 Graph minors: a survey
In this example, a file of text objects is composed of nine technical documents with titles c1-c5 concerned with human/computer interactions, and titles m1-m4 concerned with mathematical graph theory. Using conventional keyword retrieval, if a user requested papers dealing with "human computer interaction", titles c1, c2, and c4 would be returned since these titles contain at least one keyword from the user request. However, c3 and c5, while related to the query, would not be returned since they share no words in common with the request. It is now shown how latent semantic structure analysis treats this request to return titles c3 and c5. Table 2 depicts the "term-by-document" matrix for the nine technical document titles. Each cell, (i,j), is the frequency of occurrence of term i in document j. This basic term-by-document matrix or a mathematical transformation thereof is used as input to the statistical procedure described below.
TABLE 2
DOCUMENTS
TERMS c1 c2 c3 c4 c5 m1 m2 m3 m4
human 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
response 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
EPS 0 0 1 1 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
tree 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minor 0 0 0 0 0 0 0 1 1
For this example, the documents and terms have been carefullly selected to yield a "good" approximation in just two dimensions for expository purposes. FIG. 6 is a two-dimensional graphical representation of the two largest dimensions resulting from the statistical processing via singular value decomposition (SVD). Both document titles and the terms used in them are fit into the same space. Terms are shown as circles and labeled by number. Document titles are represented by squares with the numbers of constituent terms indicated parenthetically. The cosine or dot product between two objects (terms or documents) describes their estimated similarity. In this representation, the two types of documents form two distinct groups: all the mathematical graph theory documents (m1-m4) occupy the same region in space (basically along Dimension 1 of FIG. 6), whereas a quite distinct group is formed for human/machine interaction titles (c1-c5) (essentially along Dimension 2 of FIG. 6). To respond to a user query about "human computer interaction", the query is first folded into this two-dimensional space using those terms that occur in the space (namely, "human" and "computer"). The query vector is located in the direction of the weighted average of these constituent terms, and is denoted by a directional arrow labeled "Q" in FIG. 6. A measure of the closeness or similarity is related to the angle between the query vector and any given term or document vector. One such measure is the cosine between the query vector and a given term or document vector. In FIG. 6, the cosine between the query vector and each of the c1-c5 titles is greater than 0.90; the angle corresponding to the cosine value of 0.90 with the query vector is shown by dashed lines in FIG. 6. With this technique, documents c3 and c5 would be returned as matches to the user query, even though they share no common terms with the query. This is because the latent semantic structure, as captured by the depiction of FIG. 6, fits the overall pattern of usage across documents. Description of Singular Value Decomposition To obtain the data to plot FIG. 6, the "term-by-document" matrix of Table 2 is decomposed using Singular Value Decomposition (SVD). A reduced SVD is employed to approximate the original matrix in terms of a much smaller number of orthogonal dimensions. This reduced SVD is used for retrieval; it describes major associational structures in the matrix but it ignores small variations in word usage. The number of dimensions to adequately represent a particular domain is largely an empirical matter. If the number of dimensions is too large, random noise or variations in word usage will be remodeled. If the number of dimensions is too small, significant semantic distinctions will remain un-captured. For diverse information sources, 100 or more dimensions may be needed. To illustrate the decomposition technique, the "term by-document" matrix, denoted Y, is decomposed into three other matrices, namely, the term matrix (TERM), the document matrix (DOCUMENT), and a diagonal matrix of singular values (DIAGONAL), as follows: Y.sub.t,d =TERM.sub.t,m DIAGONAL.sub.m,m DOCUMENT.sub.m,d, where Y is the original t-by-d matrix, TERM is the t-by-m term matrix with unit-length orthogonal columns, DOCUMENT is the m-by-d document matrix with unit-length orthogonal columns, and DIAGONAL is the m-by-m diagonal matrix of singular values typically ordered by magnitude. The dimensionality of the full solution, denoted m, is the rank of the t-by-d matrix, that is, m.ltoreq.min(t,d). Tables 3, 4, and 5 below show the TERM and DOCUMENT matrices and the diagonal elements of the DIAGONAL matrix, respectively, as found via SVD.
TABLE 3
TERM MATRIX (12 terms by 9 documents)
human 0.22 -0.11 0.29 -0.41 -0.11 -0.34 -0.52 -0.06 -0.41
inter- 0.20 -0.07 0.14 -0.55 0.28 0.50 -0.07 -0.01 -0.11
face
com- 0.24 0.04 -0.16 -0.59 -0.11 -0.25 -0.30 0.06 0.49
puter
user 0.40 0.06 -0.34 0.10 0.33 0.38 0.00 0.00 0.01
system 0.64 -0.17 0.36 0.33 -0.16 -0.21 -0.16 0.03 0.27
re- 0.26 0.11 -0.42 0.07 0.08 -0.17 0.28 -0.02 -0.05
sponse
time 0.26 0.11 -0.42 0.07 0.08 -0.17 0.28 -0.02 -0.05
EPS 0.30 -0.14 0.33 0.19 0.11 0.27 0.03 -0.02 -0.16
survey 0.20 0.27 -0.18 -0.03 -0.54 0.08 -0.47 -0.04 -0.58
tree 0.01 0.49 0.23 0.02 0.59 -0.39 -0.29 0.25 -0.22
graph 0.04 0.62 0.22 0.00 -0.07 0.11 0.16 -0.68 0.23
minor 0.03 0.45 0.14 -0.01 -0.30 0.28 0.34 0.68 0.18
As alluded to earlier, data to plot FIG. 6 was obtained by presuming that two-dimensions are sufficient to capture the major associational structure of the t-by-d Y matrix, that is, m is set to two in the expression for Y.sub.t,d, yielding an approximation of the original matrix. Only the first two columns of the TERM and DOCUMENT matrices are considered with the remaining columns being ignored for illustrative purposes. Thus, the term data point corresponding to "human" in FIG. 6 is plotted with coordinates (0.22,-0.11), which are extracted from the first row and the two left-most columns of the TERM matrix. Similarly, the document data point corresponding to title m1 has coordinates (0.00,0.19), coming from row six and the two left-most columns of the DOCUMENT matrix. General Model Details It is now elucidating to describe in somewhat more detail the mathematical model underlying the latent structure, singular value decomposition technique. Any rectangular matrix Y of t rows and d columns, for example, a t-by-d matrix of terms and documents, can be decomposed into a product of three other matrices: Y=T.sub.o S.sub.o D.sub.o.sup.t, (1) where D.sub.o.sup.t is the transpose of D.sub.o, and such that T.sub.o and D.sub.o have unit-length orthogonal columns, (i.e., T.sub.o.sup.t T.sub.o =I and D.sub.o.sup.t D.sub.o =I) and S.sub.o is diagonal. This is called the Singular Value Decomposition (SVD) of Y. (A procedure for SVD is described in the text "Numerical Recipes--The Art of Scientific Computing", especially Chapter 2, by Press, Flannery, Teukolsky, and Vetterling, 1986, Cambridge University Press, Cambridge, England). T.sub.o and D.sub.o are the matrices of left and right singular vectors and S.sub.o is the diagonal matrix of singular values. By convention, the diagonal elements of S.sub.o are ordered in decreasing magnitude. With SVD, it is possible to devise a simple strategy for an optimal approximation to Y using smaller matrices. The k largest singular values and their associated columns in T.sub.o and D.sub.o may be kept and the remaining entries set to zero. The product of the resulting matrices is a matrix Y.sub.R which is approximately equal to Y, and is of rank k. The new matrix Y.sub.R is the matrix of rank k closest in the least squares sense to Y. Since zeros were introduced into S.sub.o, the representation of S.sub.o can be simplified by deleting the rows and columns having these zeros to obtain a new diagonal matrix S, and then deleting the corresponding columns of T.sub.o and D.sub.o to define new matrices T and D, respectively. The result is a reduced model such that Y.sub.R =TSD.sup.t. (2) The value of k is chosen for each application; it is generally such that k.gtoreq.100 for a collection of 100-3000 text objects (e.g., documents). For discussion purposes, it is useful to interpret the SVD geometrically. The rows of the reduced matrices T and D may be taken as vectors representing terms and documents, respectively, in a k-dimensional space. With appropriate rescaling of the axes, by quantities related to the associated diagonal values of S, dot products between points in the space can be used to access and compare objects. (A simplified approach which did not involve rescaling was used to plot the data of FIG. 6, but this was strictly for expository purposes). These techniques are now described. Fundamental Comparisons There are basically three types of comparisons of interest; (i) those comparing two terms; (ii) those comparing two documents or text objects; and (iii) those comparing a term and a document or text object. As used throughout, the notion of a text object or a data object is general, whereas a document is a specific instance of a text object or a data object. Also, text or data objects are stored in the computer system in files. Two Terms In the data, the dot product between two row vectors of Y.sub.R tells the extent to which two terms have a similar pattern of occurrence across the set of documents. The matrix Y.sub.R Y.sub.R.sup.t is the square symmetric matrix approximation containing all the term-by-term dot products. Using equation (2), Y.sub.R Y.sub.R.sup.t =(TSD.sup.t)(TSD.sup.t).sup.t =TS.sup.2 T.sup.t =(TS)(TS).sup.t. (3) This means that the dot product between the i-th row and the j-th row of Y.sub.R can be obtained by calculating the dot product between the i-th and j-th rows of the TS matrix. That is, considering the rows of TS as vectors representing the terms, dot products between these vectors give the comparison between terms. The relation between taking the rows of T as vectors and those of TS is simple since S is a diagonal matrix; each vector element has been stretched our shrunk by the corresponding element of S. Two Documents In this case, the dot product is between two column vectors of Y. The document-by-document dot product is approximated by: Y.sub.R.sup.t Y.sub.R =(TSD.sup.t).sup.t (TSD.sup.t)=DS.sup.2 D.sup.t =(DS)(DS).sup.t. (4) Thus the rows of the DS matrix are taken as vectors representing the documents, and the comparison is via the dot product between the rows of the DS matrix. Term and Document This comparison is somewhat different. Instead of trying to estimate the dot product between rows or between columns of Y, the fundamental comparison between a term and a document is the value of the individual cell in Y. The approximation of Y is simply equation (2), i.e., Y.sub.R =TSD.sup.t. The i,j cell of Y.sub.R may therefore be obtained by taking the dot product between the i-th row of the matrix TS.sup.1/2. While the "within" (terms or documents) comparisons involved using rows of TS and DS as vectors, the "between" comparisons requires TS.sup.1/2 or DS.sup.1/2 for coordinates. Thus it is not possible to make a single configuration of points in a space that will allow both "between" and "within" comparisons. They will be similar, however, differing only by a stretching or shrinking of the dimensional elements by a factor S.sup.1/2. Representations of Pseudo-Objects The previous results show how it is possible to compute comparisons between the various objects associated with the rows or columns of Y. It is very important in information retrieval applications to compute similar comparison quantities for objects such as queries that do not appear explicitly in Y. For example, it is necessary to be able to take a completely novel query, find a location in k-dimensional latent semantic space for it, and then evaluate its cosine or inner product with respect to terms or objects in the space. Another example would be trying, after-the-fact, to find a representations for documents that did not appear in the original space. The new objects for both these examples are equivalent to objects in the matrix Y in that they may be represented as vectors of terms. For this reason, they are called "pseudo-documents" specifically or "pseudo-objects" generically. In order to compare pseudo-documents to other documents, the starting point is defining a pseudo-document vector, designated Y.sub.q. Then the representation D.sub.q is derived such that D.sub.q can be used just like a row of D in the comparison relationships described in the foregoing sections. One criterion for such a derivation is that the insertion of a real document Y.sub.i should give D.sub.i when the model is ideal (i.e., Y=Y.sub.R). With this constraint, Y.sub.q =TSD.sub.q.sup.t or, since T.sup.t T equals the identity matrix, D.sub.q.sup.t =S.sup.-1 T.sup.t Y.sub.q or, finally, D.sub.q =Y.sub.q.sup.t TS.sup.-1. (5) Thus, with appropriate resealing of the axes, this amounts to placing the pseudo-objects at the vector sum of its corresponding term points. Then D.sub.q may be used like any row of D and, appropriately scaled by S or S.sup.1/2, can be used like a usual document vector for making "within" and "between" comparisons. It is to be noted that if the measure of similarity to be used in comparing the query against all documents is one in which only the angle between the vectors is important (such as the cosine measure), there is no difference for comparison purposes between placing the query at the vector average or the vector sum of the terms. Methodology of An Illustrative Embodiment The foundational principles presented in the foregoing sections are now applied to describe the methodology, with reference to FIG. 7, used to generate the screens displays of FIGS. 1-4--FIG. 7 amplifies on and/or encapsulates certain method steps of FIG. 5 that are particular to the illustrative example of FIGS. 1-4. Processing block 710 depicts that the starting point in the process of FIG. 7 is a catalog of abstracts, with each abstract being representative of a corresponding item (e.g., a full document). Next, processing block 720 is executed to filter the catalog of abstracts to yield a reduced set of abstracts for processing by the latent semantic indexing algorithm--recall that documents are culled so that only "good" recommendations are offered to the purchaser. Then, processing block 730 is invoked apply the latent semantic indexing algorithm to the reduced set of abstracts to produce a vector space representation of the reduced set of abstracts. With reference to the foundational principles of the previous Sections, the "term-by-document" matrix Y is formed from the terms in the reduced set of abstracts (which are now the documents). Then Singular Value Decomposition is applied, and the dimensionality of the space is selected to generate the vector space representation of the reduced set of abstracts, that is, the k largest singular values are selected to yield the approximation matrix Y.sub.R. Once the vector space representation is generated by processing block 730, processing block 740 is used to find so-called "nearest" abstracts for each abstract in the reduced set of abstracts. To accomplish this, the type of comparison utilized is the "Two Documents" comparison already discussed above. Recall in this case, the dot product is between two column vectors of Y. The document-by-document dot product is approximated by: Y.sub.R.sup.t Y.sub.R =(TSD.sup.t).sup.t (TSD.sup.t)=DS.sup.2 D.sup.t =(DS)(DS).sup.t. Thus the rows of the DS matrix are taken as vectors representing the documents, and the comparison is via the dot product between the rows of the DS matrix. To determine the "nearest" abstracts, the cosine measure is used to gauge the closeness of all other abstracts to the given abstract under consideration, that is, one-by-one each abstract is taken as a reference document and the cosine measure of all other abstracts to the given abstract is computed. The "nearest" abstracts are determined based upon pre-determined criteria, such as, the cosine being no less than 0.6 and selection of only the four closest abstracts. Once generated, the "nearest" abstracts are stored in a file for later recall during the actual "search" activity by the purchaser, as evidenced by processing block 750. Recommendations to a purchaser are expedited because the "nearest" abstracts file is generated off-line and stored, that is, the only real-time execution activity required of the on-line system is an access to the file of stored "nearest" abstracts when a purchaser, for example, adds an item to the shopping basket. It is also clear that if a new document is entered into the system and made available to the purchaser, the system is scalable in that the abstract of the new document can be considered as a pseudo-object and the abstracts "nearest" the pseudo-object can serve as recommendations to the purchaser. There is no need to immediately rebuild the stored file for additional documents so that system rebuilds can be scheduled on an as-needed basis. Moreover, in yet another implementation, because of the pseudo-object capability, it may be possible to provide the purchaser with a list of documents "closest" to a set of keywords, and then for each one of the "closest" documents provide a set of recommended "nearest" documents. The final processing step, as exhibited by processing block 760, is that of outputting to the purchaser the recommended list of "nearest" abstracts as an item is added to the shopping basket. Generic Flow Processing (Generation and Storage of "Nearest" Items File) By way of abstracting the teachings and suggestions of FIGS. 5 and 7, flow diagram 800 of FIG. 8 depicts the most general processing in accordance with the method aspect of the present invention when a file of "nearest" items is generated, usually off-line, and then stored. Processing block 810 applies a latent semantic algorithm to the items to determine a conceptual similarity among the items. It is implicit that the items form a catalog in the generic sense, and that each of the items has an associated textual description. Thus the catalog of items is not necessarily composed of documents, but can be composed of, as suggested earlier, audio tape listings, video tape listings, works-of-art, electronic product listings, and so forth; however, each item has an associated written description that can be used with a latent semantic algorithm to find the conceptual similarity among the items (e.g., inner product or dot product with the cosine measure). Next, processing block 820 is invoked to find, for each item, the "nearest" items using the conceptual similarity as a measure of "nearness". The file is stored for later recall during the shopping experience of an on-line purchaser. Finally, processing block 830 is executed so that, whenever each on-line purchaser adds a "latest" item to the shopping cart, the file of "nearest" items determined by processing block 820 is accessed to provide a recommendation of items "nearest" to the item added to the shopping cart. (Of course it is possible to return the "nearest" items to the purchaser at other points in the shopping experience, not just at the time the purchaser selects a "latest" items. For example, the "nearest" items for each item in the shopping basket could be displayed if there is sufficient screen display area to accomplish this display). Generic Flow Processing (Dynamic Generation of "Nearest" Items File) By way of abstracting the teachings and suggestions of FIGS. 5 and 7, flow diagram 900 of FIG. 9 depicts the most general processing in accordance with the method aspect of the present invention when a list of "nearest" items is dynamically generated in response to a purchaser's request--the processing by flow diagram 900 does not require storing a file of conceptual similarity among textual items. Processing block 910 applies, whenever the on-line purchaser adds a "latest" item to the shopping cart, a latent semantic algorithm is applied to the items to determine a conceptual similarity among the items and the "latest" item. It is implicit that the items form a catalog in the generic sense, and that each of the items has an associated textual description. Illustratively, to carry out processing of block 910, the matrix Y.sub.R is computed off-line and stored; when a customer adds the "latest" item, this item is used as a pseudo-object to produce the nearest items based on the conceptual similarity between the "latest" item and all of the items. Processing block 920 is then executed so that a recommendation of items "nearest" to the item added to the shopping cart is generated. (Of course it is possible to return the "nearest" items to the purchaser at other points in the shopping experience, not just at the time the purchaser selects a "latest" items. For example, the "nearest" items for each item in the shopping basket could be displayed if there is sufficient screen display area to accomplish this display). System Hardware of An Illustrative Embodiment With reference to FIG. 10, there is shown high-level hardware diagram 1000 of components that comprise an illustrative embodiment of the system in accordance with the present invention. In particular, the components of system 1000 include: (a) Web server 1010; (b) application server 1020; and (c) storage file 1030. System 1000 is coupled to conventional Internet network or "cloud" 1005. Moreover, access to Internet 1005 by a customer is via PC 1001. In operation, upon log-in and during various stages of the request-response interaction with system 1000, the purchaser is presented with a Web page in HTML format on the display of PC 1001--depicted as Web page 1002 which conveys purchaser input to system 1000, and as Web page 1003 which conveys a system response to the purchaser. When the purchaser requests information from system 1000 such as by typing and/or clicking on links on input Web page 1002, the request for information is transmitted using the "https" protocol to system 1000. In effect, the purchaser requests system responses in the usual manner by typing, pointing and/or clicking on HTML Web pages. Web server 1010 passes the purchaser's input information, such as "search" keywords entered or a Web page link clicked upon by the purchaser, depending upon the stage of the shopping experience, to input web page processor 1011 which parses the Web page to obtain information to pass along to application server 1020. If the purchaser has entered "search" keywords, then application server 1020 consults storage file 1030 to obtain data to return a response Web page. Output Web page processor 1021 receives the response data, and prepares a Web page in HTML format for transmission, via server 1010 and Internet 1005, to PC 1001 as Web page 1003. On the other hand, if the purchaser has clicked upon a item to add to the shopping basket, application server 1020 accesses that part of storage file 1030 that stores the file of "nearest" items to the clicked-upon item. The output of application server 1020 is a set of "nearest" items, which is again placed in HTML format and delivered to PC 1001, via Web server 1010 and Internet 1005, as response Web page 1003. Generalizations to the System For purposes of specificity, but not by way of limitation, system 1000 is illustrated as operating in the Internet environment with only a single server, and initially elucidates the set of services embodied in the product-purchase experience. However, it is equally clear that a general computer network implementation imbued with the structure and characteristics heretofore described can effect the applications in accordance with the present invention. For instance, the product-purchase experience can be implemented locally as well, that is, the client-server may be interconnected, for example, via a local area network (LAN) which is not coupled to the Internet. All of the aforementioned benefits apply to this local system so as to realize a product selection experience. It is possible that additional filtering may be imposed so as to generate the recommendations provided to the purchaser; such filtering may be accomplished, illustratively, by processing blocks 830 or 920. For example, it may be plausible that items below a certain price may not warrant a recommendation, e.g., the price of the item placed in the basket. Also, the "nearest" items list could be filtered based on something known about the purchaser, e.g., no "adult" content for kids, or only content written in languages x and y but not language z. In addition, one might want to take into consideration other preferences known about the user and, for instance, change the ordering of items shown. This post-filer processing complements the pre-filtering processing already discussed. In addition, it is possible to use multiple items as input to the latent semantic algorithm to generate a recommendation. To accomplish this, terms from each of the multiple items may be employed to generate a pseudo-object, that is, the pseudo-object is a composite of the terms in the multiple items. Accordingly, an extension to the illustrative embodiment is that of using all the items in the shopping basket or a subset of items in the shopping basket, in contrast to the latest item, to generate a recommendation. The technique for accommodating multiple items as input is normally implemented in real-time since it would be virtually impossible to generate and store a "nearest" items file using permutations of all items in the catalog to form composite pseudo-objects. Finally, the recommended list could be e-mailed to the purchaser rather than displaying the list immediately on the screen. This may occur when, for example, the recommended list may be too large to be conveniently displayed on the screen display. Also, a recommendation could be e-mailed to the purchaser when a new item is added to the catalog and such added item, if available during the time of the prior interaction with the purchaser, would have been included in the list of recommended items. The e-mail functionality may be implemented by application server 1020. Although various embodiments which incorporate the teachings of the present invention have been shown and described in detail herein, those skilled in the art can readily devise many other varied embodiments that still incorporate these teachings.
|
Same subclass Same class Consider this |
||||||||||
