Computer method and apparatus for clustering documents and automatic generation of cluster keywords5857179Abstract A computer method and apparatus determines keywords of documents. An initial document by term matrix is formed, each document being represented by a respective M dimensional vector, where M represents the number of terms or words in a predetermined domain of documents. The dimensionality of the initial matrix is reduced to form resultant vectors of the documents. The resultant vectors are then clustered such that correlated documents are grouped into respective clusters. For each cluster, the terms having greatest impact on the documents in that cluster are identified. The identified terms represent key words of each document in that cluster. Further, the identified terms form a cluster summary indicative of the documents in that cluster. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
______________________________________
Step 0 Remove all tokens with numerical characters.
Step 1 Remove all tokens with three consecutive
characters as the same letter.
Step 2 Normalize all upper and lower case letters.
Step 3 Remove all keywords having special
characters other than "-" and ".sub.-- ".
Step 4 For all keywords having either "-" or ".sub.-- ",
remove "-" or ".sub.-- " from the string and
concatenate. (e.g., back-up=>backup).
Step 5 Run through a stop word list of 539 entries.
(e.g. "a", "the",
Step 6 Ignore the words in a line if the line is at
the end of a work unit or article and the
line has less than 3 words. (e.g. signatures
and/or closing lines)
Step 7 Remove all words occurring in less than P
documents.
______________________________________
For example, in an experiment conducted by the inventors, a bag-of-words model was used to determine a list of words for an experimental domain. The model was operated on a UNIX.sub.-- 95 hotline database maintained by Digital Equipment Corporation.TM. which is presently indexed by approximately 300,000 words. This database contains documents generated by customers and hotline employees to describe problems and explain how they were resolved. Because the experimental domain consists of entirely technical articles, on-line dictionaries were of little use, since many technical terms and acronyms are not represented in general dictionaries. Moreover, since the UNIX.sub.-- 95 database covers a limited class of UNIX-related problems, the distinction between problem topics or clusters is less well-defined. This made the experimental task more challenging than working with a corpus dealing with a wide range of topics. The pre-processing steps listed above reduced the 300,000 words in the domain to 11,707. Given a document, for each word in the domain, step 21 (see FIG. 2) of process 37 calculates a weight of that word as used in the document. The weight defines one of the M vector values for the document's vector. Vector values for domain words that do not appear in the document are set to zero. Weightings (or vector values) of the domain words appearing in the document are defined by popular weighting methods including the following two methods described in Singhal, et al., "Length Normalization in Degraded Text Collections", TR95-1507, Dept. of Computer Science, Cornell University, incorporated herein by reference and provided by way of example: a) "Tf-idf" This is one of the most popular weighting methods used in document clustering and classification. The "tf" or "term-frequency" is a simple count of the occurrences of each term in a specified document. ##EQU1## Where W.sub.ij is the weight of term "i" in article "j". tf.sub.ij is the frequency of occurrence of term "i" in article "j". df.sub.i is the total number of articles in which term "i" occurs. n.sub.articles is the total number of articles in the domain. b) "Tf-idf with length normalization" This weighting method is a variant of the tf-idf method which accounts for the length of the article by using a normalization factor. The formula for the computation of the weight W.sub.ij is as follows: ##EQU2## Where W.sub.ij is the weight of term "i" in article "j". tf.sub.ij is the frequency of occurrence of term "i" in article "j". df.sub.i is the total number of articles in which term "i" occurs. n.sub.articles is the total number of articles in the domain. Collectively, the vectors form an initial document by term N.times.M matrix 50 (see FIG. 3) where N is the number of documents in the domain and M is the number of terms in the domain. The initial matrix 50 is preferably formed at step 23 and held in working memory. Subroutine 37 next reduces the dimensionality of the document by term N.times.M matrix as shown in step 25 (see FIG. 2). This is preferably accomplished using the technique of Principal Component Analysis (PCA) 33. PCA involves computation of the Eigenvectors of the sum squared of products matrix, the correlation matrix, or the covariance matrix of the initial matrix and projection of the initial matrix onto the Eigenvectors. Singular Valve Decomposition (SVD) is one of several methods for computing the Eigenvectors. SVD is preferred because it is the most computationally efficient of the known methods. At the Singular Value Decomposition step (SVD) 31, subroutine 37 decomposes the initial document by term N.times.M matrix 50 to obtain a left matrix 52 of N.times.M dimensionality (see FIG. 3) representing the Eigenvectors of AA.sup.T where A is the initial Document by Term matrix (or Term by Document matrix as explained below) and A.sup.T is the transpose of the initial matrix; a center (or singular) matrix 54 of M.times.M dimensionality consisting of the singular values of the initial matrix 50; and a right matrix 56 of M.times.M dimensionality representing the Eigenvectors of A.sup.T A. In a preferred embodiment of the present invention, the right matrix 56 comprises the Eigenvectors of the sum squared of products matrix, however, depending on the application, the Eigenvectors of the covariance or correlation matrix may be used. Those skilled in the art will readily be able to determine which Eigenvectors to use. In the preferred embodiment, SVDPACKC.TM. available as public domain software is employed to carry out the SVD. Note that the SVD process requires that the number of rows in the initial matrix be greater than the number of columns. In some domains, albeit a minority, the number of terms retained after preprocessing may be greater than the number of documents. In these cases, a term by document matrix of dimensionality M by N is used for the initial matrix and SVD is performed on the term by document matrix, instead of the document by term matrix as described above. This results in a left matrix of dimensions M by N and center and right matrices of dimension N by N. Returning now to the preferred embodiment where the initial matrix 50 is a document by term matrix, the right matrix 56 of M by M dimensionality is reduced in dimensionality to a working matrix of loading vectors 58 of M by K dimensionality at step 25 (see FIG. 3). This involves reducing the number of columns in the right matrix to K columns as shown in step 33 (see FIG. 2). The larger the value of K, the more literal the connection is between documents. The smaller the value of K, the more conceptual the relationship is between documents. For determining the value of K, the following rule of thumb is suggested: 1) 100<K<200 for domains having up to 100,000 documents; 2) K.gtoreq.200 for domains having greater than 100,000 documents. This step provides a working matrix of loading vectors 58 of M by K dimensions. Next, at step 34 (see FIG. 2), the initial matrix 50 (see FIG. 3) is projected onto the working matrix of loading vectors 58. This last step of the PCA 25 technique allows for linear mapping of multidimensional data onto lower dimensions with a minimal loss of information. The projection 34 (see FIG. 2) yields independent combinations of the individual variables that represent directions of greatest variance in the original variable space. These combinations are the Eigenvectors of the covariance matrix or the sum squared of products matrix or the covariance matrix of the initial matrix 50. The projection 34 yields a matrix of reduced dimensionality 60 of dimensions N by K (see FIG. 3). For purposes of the present invention, the PCA "variables" comprise individual terms and PCA is employed to capture the underlying correlation in the terms. The reduced dimensionality affords efficient manipulation of the resulting vectors. The N.times.K matrix of reduced dimensionality 60 contains principal component scores F.sub.ij : ##EQU3## where i .epsilon. {l, . . . , n} and indicates a document; j .epsilon. {l, . . . , k} and indicates reduced dimension of document vector; x is the weight of term h in document i; and y is the Eigenvalue at the j.sup.th position in loading vector h. The N.times.K matrix of reduced dimensionality 60 thus represents the documents as vectors in a reduced space of K dimensions reduced from the original M-dimensions. Although dimension and noise are reduced for purposes of manageability, important and indicative words or terms are maintained in the vectors of the matrix of reduced dimensionality 60 for meaningful processing in the following steps. Following PCA 25, clustering is performed on the matrix of reduced dimensionality 60. At step 27 in FIG. 2, a Euclidean distance metric is preferably employed to cluster the vectors of the matrix of reduced dimensionality 60. In a preferred embodiment, a K-means clustering algorithm is employed, such as that described by A. C. Jain and R. C. Dubes, "Algorithms for Clustering Data," Prentice Hall Advanced Reference Series, Prentice Hall, Englewood Cliffs, N.J. 07632, pgs. 96-101, (1988) incorporated herein by reference. As a result, clusters of documents are formed and represented by the vectors of the N.times.K matrix 60. Clustering the N.times.K matrix of reduced dimensionality 60 is generally more robust than clustering in the original high-dimensional document by term N.times.M matrix 50. Other clustering techniques are applicable to the present invention. One of the more popular distance measurements used for document clustering is the cosine similarity measurement. Although the inventors are aware of no formal studies comparing the performances of different similarity measurements, there has been some criticism of the cosine similarity measurements. Hierarchical clustering techniques historically have been a popular method for document clustering. A significant disadvantage of hierarchical clustering algorithms is that they are computationally expensive. Furthermore, usefulness of such algorithms is questionable in a dynamic environment where new documents are continuously added to the collection. For each document in a given cluster, step 29 (see FIG. 2) computes which words had a substantial effect on each cluster. Those words are in common amongst the documents in the given cluster and therefore characterize the cluster. In particular, step 29 examines each of the products being summed in the principal component scores F.sub.ij a document. The largest of such products are identified and the corresponding term or word of the products are considered to be the words having the most impact or greatest effect on the cluster to which the document has been assigned. In the preferred embodiment, the products being summed in the principal component scores of a document are normalized according to any of several known methods. In the preferred embodiment, to accomplish normalization, the ratio between a product in a subject principal component score and the largest product in the subject principal component score is made. The former serves as a numerator and the latter is the denominator. The words corresponding to the ratios having a value greater than 0.5 are determined as having substantial (maximal) impact on a cluster. Those words are then placed in a cluster summary for indicating subject matter/content of the documents in the cluster 30. Depending on the type of search initiated by the user, the summation of individual contributions can be performed according to either of the following techniques: 1) A set of words categorizing the articles in the cluster that are "hit" as a result of the query (browsing and search). 2) A set of words categorizing the entire cluster (browsing). Alternatively, the combination of these two categorized techniques can be used to determine the validity of the cluster for the present query. In the first technique, a user may be interested in only those articles in each cluster which contain the query word. In this case, the user is interested in categorizing the cluster only from the point of view of the documents that contain the particular keyword. It should be noted that pre-computing the contribution of the words for each keyword present in at least one vector in a cluster may be computationally expensive, even though this is a one-time expense. However, the on-line computation of these contributions for only the document vectors that are a result of the query is readily accomplished since this is far less computationally expensive. If the user is browsing, that user may prefer to know the contributions of each word for the entire cluster. This would be true in situations where the user is interested in a smaller number of documents than those which are computed in the first technique, and the user would like to browse those documents which do not necessarily contain search words in the query, but yet are still present in the cluster. In this manner, the user is not limited by the query words chosen, but instead may search beyond the query word. In such a case, it is possible to compute the sum total of the contribution of all document vectors in a particular cluster. An experiment was conducted using the UNIX.sub.-- 95 database from a popular COMET database maintained by Digital Equipment Corp..TM. At the time, the UNIX-95 database consisted of 22,509 documents. For the experiment, the entire set of documents was used. Principal components were computed using a sparse iterative SVD algorithm. For principal component analysis (PCA) it is necessary to define the number of components or "K" to be retained in the working matrix of loading vectors 58. Despite efforts to maintain the database preprocessing to a series of steps which are largely domain independent to facilitate replicability, in this case there was a need for some specific domain-dependent preprocessing. This pertains primarily to certain types of input format which have an associated template. Often, if the length of a document is not sufficiently large, the words in the template dominate the document content and these documents tend to get clustered together on the basis of common templates. In an effort to avoid such meaningless clusters, the template information was masked out. In the final implementation, "K" was selected as 533 principal components from an initial matrix containing 11,707 terms. Although in the initial set of experiments K was set to 533, subsequent experiments with K set following the above-prescribed rule of thumb (100<K<200) produced similar results. The total of 22,509 documents in 533 dimensional projected space were clustered into 400 clusters. Partial results of this clustering are shown in FIGS. 4A and 4B and described below. As explained above, the categorization is accomplished by combining the loading of the different elements in each individual cluster. It therefore follows that words having non-zero loading will be words that appear at least once in that cluster. In FIG. 4A, experimental clusters 0, 140, and 222 are shown with the top 15 words categorizing each cluster. It should be noted that the words beyond the top 15-25 offer little information and therefore, do not have useful categorizing power. As seen in FIG. 4A, each cluster can be distinguished by the words that categorize the cluster. Cluster 0 clearly is directed to "backing up and restoring from dump tapes", cluster 140 is directed to "printing and printer related" problems, and cluster 222 deals with "rebuilding or installing of a system". Note that in cluster 222, the third word is "kernal" which is a misspelling for "kernel". Therefore, the word was commonly misspelled in the database. It can also be noted that certain words such as "sandy" in cluster 140 and "khp" in cluster 222 indicate the names or initials of hot-line specialists who typically handle this class of problems, and therefore, their names occur among the top 15 words. Although these words do have a significance from a data mining standpoint, they do not add much information for characterizing clusters. Depending on the application, filtering methods can be employed to remove the signature tags in a manner similar to the masking of input templates described above. Considering the clusters shown in FIG. 4B, it is clear that all clusters deal with printer related problems. However, by studying the top few words, it is possible to distinguish between clusters 140, 277, and 291. The purpose of such fine-grained clustering is to allow the users to narrow down their search to a very limited number of articles. FIG. 5 provides the results of a specific query "backup" 102 for an on-line categorization of clusters as described above. This query resulted in a hit of 2722 documents 103 in the UNIX-95 database 101. These results were organized as categorized clusters and cluster summaries 104A-104F. Six clusters with the highest number of query "hits" were returned as shown. Each cluster summary 104A-F includes a cluster identification number 112A-F, keywords 105A-F, and document titles 106A-F. The identification numbers 112A-F identify a given cluster in a particular domain. A document count 107A-F is returned along with the identification number 112A-F to indicate the number of documents in the cluster which resulted in query "hits". For example, 64 documents 107A in cluster 0 112A resulted in hits from the query "backup" 102. Keywords identifying the cluster 105A-F are also returned, preferably in order of weighting. For example, in cluster 9 112B, the keywords "backup", "copy", "directory", "create", and "pages" represent keywords for that cluster. Document titles 106A-F are also returned in the cluster summaries 104A-F. The document titles 106A-F allow the user to scan the titles of the documents to determine which documents the user may wish to further investigate. For example, in Group 106, the title "? on NSR backing up Novell servers" indicates to the user that the document is related to general questions regarding backing up on Novell servers, while the title "NSR client won't back up" indicates a document related to a different type of problem. Each document title is transmitted along with a distance indicator 108A-111A. The distance indicator 108A-111A represents the vector distance from the centroid of that particular cluster centroid. For example, a distance of 0.11 108A is closer to the centroid than a distance of 0.20 111A and therefore, the document corresponding to the centroid distance of 0.11 is probably more relevant to the query than the later document. As seen in FIG. 5, Group 0 is directed to "dump command and backing onto tapes" while Group 119 deals with "restoring from the root directory". Groups 70, 106, and 121 all deal with backup across networks. However, each group handles a separate issue, for example, Group 70 is directed to product Networker.TM., while Group 121 deals with DECNSR.TM. related issues. It should be noted that Group 9 consists of documents with no real characterization because this grouping is clustered primarily on template information. Programmers implementing the present invention are free to extract the cluster summary data to fit the needs of the application. The cluster summary data may be presented in a user friendly interactive interface to allow the user to select the format of the returned data or to select a particular number of keywords or documents in each cluster summary. For example, in FIG. 5, the number of keywords is set to 10 and the number of documents is set to 4. The following example describes the working of principal component analysis (PCA) in the context of the present invention. Although the present invention will be commonly employed for application in databases having complete articles, clustering and retrieval of the sentences shown in FIG. 6A is used for purposes of example. Sentences 1-19 are analogous to documents in a database. From the list of 19 sentences shown in FIG. 6A, 12 words were chosen to form the vectors in the bag of words model. This resulted in an initial document by term matrix 64 of dimension 19.times.12 as shown in FIG. 6B. The initial matrix 64 is referred to herein as matrix A and each element of the initial matrix is referred to herein as a.sub.ij. Singular value decomposition using SVDPACKC.TM. was performed on the initial matrix 64 to generate the right matrix 66 shown in FIG. 6C. This matrix 66 represents the Eigenvectors for the right matrix resulting from singular value decomposition. For purposes of the present analysis, a value of 9 was selected for "k" and the initial matrix 64 was projected on the first nine Eigenvectors of the loading vector matrix 66. This reduced loading vector matrix 66 is referred to herein as E and each element of the loading vector matrix is referred to herein as e.sub.ij. The projection of the initial matrix A 64 into the reduced dimensional space of loading vector matrix E 66 is obtained as shown below to form a matrix of reduced dimensionality 68 as shown in FIG. 6D. P=A.times.E.sup.T where: P is the matrix of reduced dimensionality 68 E.sup.T is the transpose of the working matrix of loading vectors 66 A is the initial matrix 64 The resultant projected values of the initial matrix onto the first 9 Eigenvectors of the working matrix of loading vectors is shown in FIG. 6D as a matrix of reduced dimensionality P 68 (19.times.9 dimensionality). This resulting matrix of reduced dimensionality 68 was then clustered into three clusters using a K-means clustering algorithm. The resulting centroids 70 in the elements in each cluster are illustrated in FIG. 6E. Using the centroid shown in FIG. 6E, cluster memberships were defined using a nearest-neighbor rule with each centroid. The categorization of each cluster is accomplished by computing the loading of each element in that cluster across all patterns in that cluster: ##EQU4## where: load.sup.P.sub.j is the load of element j in cluster P; and PRINC is the number of principal components chosen to compute the loads, otherwise referred to herein as "k". For the present example, the value of parameter PRINC was arbitrarily set to 5. In general, for large domains PRINC is determined using the rules of thumb described above for setting K. The results of these computations are shown in FIG. 6F as Clusters I through III. Note that in Cluster I, both "restore" and "dump" occur in the same frequency but "restore" has a higher loading than "dump". Similarly, in the same cluster, note the same phenomenon with "backup" and "questions". Further note that in a preferred application of the present invention, the vectors of the initial matrix A are prepared using either the "Tf-idf" or the "normalized Tf-idf" weightings as described above. This will add some domain knowledge to the weightings, thus making the loading more useful. Given the above, in FIG. 6G, a query "backup" is entered by the user. This query results in "hits" of article numbers 11, 12, 17, and 18, which fall into cluster numbers I and III. Computing the contributions of the words for only these articles provides the results shown in FIG. 6G. In the tables of FIG. 6G, depending on the interest, whether it involves dump/restore problems or NSR backup problems, the user chooses the appropriate cluster for browsing. It should be noted that in this example, words with non-zero contributions are chosen to identify the clusters. As explained above, in the real application, only the top few words will identify each cluster as keywords. In this manner, a domain of documents is represented as a large dimensionality term by document initial matrix. The dimensionality of the initial matrix is reduced to form resultant vectors. Subsequently, the resultant vectors are clustered such that the corresponding documents are grouped into a plurality of clusters. The clusters are defined by keywords which are automatically generated as a result of clustering the resultant vectors. The keywords having highest weight are grouped together to form cluster summaries indicative of the documents in that cluster. This is a significant advantage over prior art techniques which first cluster the documents and require human interpretation of the cluster summaries with regard to categorization and naming of cluster keywords. A given domain may be continually clustered in accordance with the present invention as new documents are added to a domain. Alternatively, the domain may be reclustered periodically in an off-line process. For example, re-clustering may be performed nightly during periods of low domain access. Furthermore, clustering may take place on-line in response to a user query. Other strategies for re-clustering may also be used. While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
|
Same subclass Same class Consider this |
||||||||||
