Method for data and text mining and literature-based discovery6886010Abstract Text searching is achieved by techniques including phrase frequency analysis and phrase-co-occurrence analysis. In many cases, factor matrix analysis is also advantageously applied to select high technical content phrases to be analyzed for possible inclusion within a new query. The described techniques may be used to retrieve data, determine levels of emphasis within a collection of data, determine the desirability of conflating search terms, detect symmetry or asymmetry between two text elements within a collection of documents, generate a taxonomy of documents within a collection, and perform literature-based problem solving. (This abstract is intended only to aid those searching patents, and is not intended to limit the disclosure of claims in any manner.) Claims 1. A method of retrieving data, relevant to a topic of interest, from at least one collection of documents, comprising the steps of: Description BACKGROUND OF THE INVENTION
For clustering in particular, the non-retrieval of critical technical phrases by the phrase extractor will result in artificial cluster fragmentation. Conversely, the retention of non-technical phrases by the phrase extractor will result in the generation of artificial mega-clusters. SUMMARY OF THE INVENTION It is an object of at least one embodiment of the present invention to maximize both the number of documents (defined herein as a text record in any format) retrieved and the ratio of relevant to non-relevant documents (signal to noise ratio) during a literature search. It is an object of some embodiments of the present invention to use text and data mining to identify topical matters that have been emphasized in prior research. It is also an object of some embodiments of the present invention to use text and data mining as a tool for innovation. These and other objects are achieved, in one embodiment, by using a test query to retrieve a relative sample of documents from a database, classifying the retrieved documents as relevant or not relevant, finding text element (phrase) frequencies and text element co-occurrences in at least the relevant documents, grouping the extracted text elements into thematic categories, and then using the thematic grouping and phrase frequency data to develop new queries and query terms. New query terms are tested against the representative sample of documents. If the signal-to-noise ratio of the newly added terms is above a specified limit in the representative sample, the newly added terms are maintained in the developing query. The developed query is then applied to the full database. In another embodiment of the invention, a taxonomy may be developed from a collection of documents. High technical content text elements are extracted from the collection, and used to generate a factor matrix. The text elements with the largest influence on the themes of each factor (category) are extracted from the factor matrix, and used to generate a co-occurrence matrix of high technical content phrases. The matrix cell values are then normalized (by equivalence index or inclusion index) and text elements are grouped, using clustering techniques, on the normalized matrix. The text element frequencies of occurrence within each group are summed to indicate a level of emphasis for each group. Document clustering techniques can also be used to assign document clusters to the groups defined above, to produce levels of emphasis. In another embodiment of the invention, the factor matrix process for selecting text elements with the largest influence on the themes of each factor (described in the previous paragraph), or any similar latent semantic analysis approach for selecting important text elements within a thematic category, may be used to identify asymmetries in documented phenomena where none were expected. In a further embodiment of the invention, text and data mining techniques are applied to assist in developing solutions to a given problem. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows the factor eigenvalue-factor number plot for un-rotated factors on a linear scale. FIG. 2 is a ten factor plot. FIG. 3 is a plot of Break Point vs. Number of Factors on a linear scale. FIG. 4 is a re-plot of FIG. 3 on a log-log scale. DESCRIPTION OF THE PREFERRED EMBODIMENTS The following definitions will assist in a clear understanding of the present invention: The present invention includes advances in clustering, advances in information retrieval, and unique applications of these clustering and information retrieval advances. In the present invention, the advances in clustering are an integral component of the advances in information retrieval. The advances in information retrieval may or may not be an integral component of the advances in clustering, depending on the specific application. Because clustering is foundational to all the unique applications, it will be described first. Then, the information retrieval steps will be described, with the clustering inserted appropriately. Finally, the unique applications will be described, with the clustering and information retrieval inserted appropriately. Clustering is the grouping of objects by similarity. In text mining, there are two main types of objects, text elements (e.g., words, phrases) and documents. Each of these object types can be grouped manually (assignment to groups by visual inspection) or statistically (assignment to groups by computer algorithms). Thus, there are four major clustering categories to be discussed in this invention:
These will be described in order of increasing complexity, relative to how they are used in the process and system covered by the present invention. In the present invention, the advances in the manual clustering techniques are in the the unique applications of the techniques. The advances in the statistical clustering techniques are in the improved quality of the text elements, or documents, that are input to the clustering algorithms, as well as in the unique applications. In manual text element clustering, a technical expert is presented with a list of text elements. The generation of that list will be described under information retrieval. The technical expert, by visual inspection, assigns selected (or all) text elements from the list into categories. These categories could be pre-selected from a standard classification scheme, or could be generated by the technical expert during the assignment process to include all the text elements. The use of these categories of grouped text elements will be described in the section on unique applications. In manual document clustering, a technical expert is presented with a set of documents. The generation of that set will be described under information retrieval. The remainder of the process is identical to the manual text element clustering. In statistical text element clustering, a list of text elements is presented to a computer algorithm. The generation of that list will be described under information retrieval. The first step of the algorithm is to generate a factor matrix (or similar latent semantic category generator), whose rows are the text elements, and whose columns are factors. These factors represent the major themes of the database being analyzed. The matrix elements are numerical values called factor loadings. Each matrix element Mij represents the contribution of text element i to the theme of factor j. If the text elements in a specific factor are arranged in numerical order, one tail of the factor will have high positive value text elements, and the other tail of the factor will have high negative value text elements. Usually, but not always, the absolute value of the text elements in one tail will dominate the absolute value of the text elements in the other tail. The relatively few high factor loading text elements in the predominant tail will determine the theme of the factor. The predominance of a few high factor loading text elements (in the high factor loading tails) in determining the factor themes leads to the second step of the algorithm. The high factor loading text elements that determine the theme of each factor in the factor matrix are extracted and combined. The remaining text elements that do not have high factor loadings in any factor are treated as trivial text elements in the context of the database being analyzed, and are excluded from the text element clustering that follows. Some of these excluded text elements may have the appearance of high technical content text elements. However, in the context of determining factor themes, their contribution is negligible. Thus, one major advance of this factor matrix filtering technique is to select high factor loading text elements for clustering that are context-dependent (see Example 1). Before these filtered text elements are input to the clustering algorithm, some of them are conflated to reduce the dimensionality of the system; i.e., reduce the number of different text elements that the algorithm has to process. Conflation is the process of combining text element (mainly word) variants into a common form. This could include combining singulars and plurals, different tenses, etc. Most, if not all, conflation software available today is context independent. The various stemming algorithms (e.g., Porter's) have fixed rules for conflation, independent of context. The present technique of factor matrix filtering allows variants to be conflated only if they appear on the list of high factor loading extracted text elements. Thus, if the singular variant of a text element is on the high factor loading list, and the plural version has low factor loading, then these text elements are being used in a different context in the specific database being analyzed, and cannot be conflated. Conversely, if both variants are on the high factor loading list, and especially if their numerical values are close, they are being used interchangeably in determining factor themes in the specific database being analyzed, and can be conflated. Thus, a second major advance of this factor matrix filtering technique is to select text elements for conflation that are context dependent (see Example 2). The text elements that have been filtered by the factor matrix (and, typically, conflated) are then input to a text element clustering algorithm. Depending on the application, a multi-link hierarchical aggregation clustering algorithm, or a partitional clustering algorithm, may be used for text element clustering. The multi-link clustering approach provides a hierarchical categorization structure, and is particularly useful when the database being analyzed has a strong central theme, with inter-related component sub-themes. Analyses of single technical disciplines (e.g., aerodynamics, fullerenes, electrochemistry) tend to fall within this category. The partitional clustering approach provides a flat (single level) categorization structure, and is particularly useful when the database being analyzed has multiple disparate themes. Analyses of multi-discipline organizations or national research programs tend to fall within this category. However, a partitional clustering algorithm could provide a hierarchical structure if applied to a single discipline correctly. The clusters output by the computer algorithm would have to be combined to form the hierarchical structure. In addition, a hierarchical structure could provide a flat partitional structure. At any given level in the hierarchy, the separate categories could be viewed as a partitional structure. In statistical document clustering, two general approaches can be used. One is the traditional context-independent approach, and the other is the context-dependent approach described in this application. In the traditional approach, a set of documents is presented to a computer algorithm for matching quantification (e.g., assignment of a similarity metric for the pair). The algorithm compares each pair of documents, and assigns a similarity metric to the pair. The similarity metric could be the text elements shared by the documents, or other type of metric. The algorithm then constructs a matrix (of all documents) whose elements are the similarity metrics. The algorithm then aggregates the documents into similar groups. Many document clustering algorithms are readily available commercially or through freeware. The use of these categories of grouped documents will be described in the section on unique applications. In the context-dependent document clustering approach, pre-processing of the text in the documents is performed before the set of documents is presented to the computer algorithm for matching quantification. This pre-processing is the same as that described in the section on statistical text element clustering. Factor matrix filtering is performed on the text in the documents to conflate the text element variants in order to reduce dimensionality, and remove the text elements that do not influence the themes of any factors. This results in documents that consist of high factor loading context-dependent text elements being provided to the computer algorithm for matching quantification. Some applications of the four different clustering approaches will be described. These include information retrieval, level of emphasis determination, citation mining, literature-based discovery, and literature-based asymmetry prediction. Information retrieval, in the present context, is the retrieval of one or more documents from a source database that are relevant to the objective of the database search. The database search could be manual (e.g., reading a journal or conference proceedings and extracting relevant papers) or electronic (e.g., providing a set of instructions, called a query, to a database search engine to look for documents with desired characteristics). While the information retrieval advances to be described could conceptually be applied to either the manual or electronic searches, in practice they are mainly applicable to the electronic searches. From another perspective, information retrieval can be differentiated by the boundedness of the source database; i.e., narrowly bounded or broadly bounded. Examples of narrowly bounded would include all the papers published in a specific journal volume or conference proceedings, or all the papers published by German authors in Medline journals in 2001. Examples of broadly bounded would include all papers published in Medline journals on cardiovascular problems, or all papers published in Science Citation Index journals on fluid flow problems. While the clustering techniques described previously could be applied to documents retrieved using either manual or electronic searches, or documents from narrowly bounded or broadly bounded source databases, the information retrieval advances independent of the component clustering advances (or non-advances) will mainly focus on documents retrieved with electronic searches from broadly bounded databases. The information retrieval process of this invention is overviewed (see Example 3). Then, the specific steps and advances are described in detail. The retrieval process is focused on developing a query (group of terms that will retrieve comprehensive records from a source database and yield a high ratio of relevant to non-relevant records). This query is then provided to a database search engine, and comprehensive, highly relevant records are retrieved. The advances made in the information retrieval component of the invention occur during the course of query development. The query development process is iterative, and incorporates relevance feedback at each iterative step. In the first step of query development, a collection of documents, such as a database, is selected. A test query is then applied to the collection of documents. The test query may be any search term, or number of terms. Terms in a query, or test query, may be joined by Boolean connectors such as "AND", "OR", or "NOT". Typically, the user will select a test query believed likely to retrieve a collection of text material having a greater ratio of relevant to non-relevant documents than that existing in the original collection of documents. A sample of the documents retrieved with the test query is then chosen using criteria unlikely to bias the selection of search terms. The sample size is selected to be both representative and manageable. Generally, the larger the sampling, the more likely the method of the present invention will produce improved search results when applied to the complete database. Of course, as the sample size increases, the development of sample searches according to the present invention becomes more time-consuming and labor-intensive. With improvements in computer technology, larger sample sizes will become more reasonable. At this point, the sample of retrieved documents is classified according to the documents' relevance to the subject matter of the search. The relevance classification may be binary, such as either ‘relevant’ or ‘not relevant’, or may be graded or ranked on a verbal or numerical scale (e.g., 1-5, or ‘strongly relevant’, ‘moderately relevant’, etc). The classification may be performed in a computer-intensive mode, or a manually-intensive mode. The computer-intensive mode is much faster, but is moderately less accurate. In the computer-intensive classification mode, document clustering software is used to group the documents in the retrieved sample by similarity. The document clustering is a three-step process. In the first step, the raw documents are processed by factor matrix filtering to remove trivial text elements and conflate text element variants. This step removes much of the background ‘noise’ from the documents, and minimizes similarity resulting from matching of trivial text elements. This is the pre-processing step. In the second step, all documents in the retrieved sample are compared on a pair-wise basis. A similarity metric is assigned to each pair (e.g., number of words in common divided by total words in both documents). This is the matching step. Then, the documents are grouped, such that the similarity among documents within the group is large, and the similarity of documents between groups is small. This is the clustering step. Commercial software is available to perform document clustering. Document clustering tends to group documents into groups that are at similar levels of relevance. A technical expert then samples documents from each group, and performs a final judgment as to the relevance of each group. In the manually-intensive classification mode, the technical expert reads each document in the retrieved sample, and performs the final relevance judgment. Once the documents have been classified according to relevance, the unique text patterns in each relevance category are identified, and used to modify the query accordingly. The approach described is a hybrid of statistical and manual. The first step in text pattern identification is the extraction of text elements from each relevance categorization of documents. TextSlicer™ (TS) from Database Tomography (DT), for example, may be used for performing text element (word or phrase) extraction, although any other verified text element extraction software may be used. The TextSlicer™ software allows for multiple word/phrase counting; i.e., a word can be counted as a stand-alone single word as well as when it is used in multi-word phrases. This feature is especially valuable for generating taxonomies, where the shorter phrases can serve as category headings. (The Natural Language Processing software that we also use for multiple tasks, TechOasis™, does not allow this multiple counting.) There are typically two levels of filtering in TextSlicer™. Stop-words in the algorithm eliminate trivial words such as ‘the’, ‘and’, etc. Regardless of the software used, it is typically best to remove "stop-words" and other trivial phrases. Regardless of the software used, manual cleanup may then be performed to eliminate lower technical content phrases. A frequency analysis is then performed on the extracted text elements. If the documents selected for extraction include more than the most relevant reviewed documents, this analysis can compare the frequency of a particular text element within highly relevant documents to its frequency within less relevant reviewed documents. The frequency analysis generates a list of extracted text elements, including frequency data for each listed text element. The frequency data includes the number of times the text element appears in the reviewed documents in a particular category of relevance to the subject matter of the search. The next step in the text pattern identification is grouping of text elements in thematic categories. The process recommended primarily is statistical text element clustering. If time available is limited, then the first phase of statistical text element clustering, namely, factor matrix generation, can be used as an alternative to the full process. Here, the factors from the factor matrix serve as a proxy for the clusters from the clustering algorithm. The purpose of the groupings in each relevance category is to insure that the query has representation from each of the major themes of each relevance category. This will insure a balanced query, and that major themes are not overlooked. For example, if a binary relevance system (relevant/non-relevant) is chosen, and clustering shows that the relevant documents can be thematically divided into four main clusters, then query text elements should be selected from each of the four clusters. Thus, the thematic grouping serves as a guide for query term selection, to be used in conjunction with the following criteria and process for selecting query terms. The use of groupings as guides for the query term selection, and the generation of these groupings by the statistical text element clustering process, represent advances of the present invention. At this point in the process, a co-occurrence matrix of the highest frequency text elements in each relevance category is generated. Each element Mij of the text element co-occurrence matrix is the number of times that text element i occurs in the same spatial domain as text element j. In practice, the co-occurrence matrix element is usually the number of domains in which text element i co-occurs with text element j. The spatial domain could be a semantically-defined domain (e.g., sentence, paragraph, abstract, etc), or numerically-bounded domain (e.g., every 200 word block in a document). Typically, the matrix cell values of the co-occurrence matrix are normalized, e.g., by equivalence index or inclusion index. Cell values for the matrix may also be normalized by standard statistical techniques, resulting in a normalized correlation matrix. The query term selection now proceeds as follows. The extracted text elements are reviewed by a technical expert(s) and divided into three groups:
The co-occurrence of text elements in the frequency-analyzed documents is then analyzed to generate a list of co-occurrence pairs. Each of these co-occurrence pairs includes an anchor text element (selected so that each major thematic category generated by the grouping of text elements is represented by at least one anchor text element) and another extracted text element. This analysis generates a list of co-occurrence pairs including co-occurrence data for each listed co-occurrence pair. The co-occurrence data is combined with the frequency data for the extracted text elements. A subject matter expert or expert system then reviews the frequency data for the extracted text elements and the co-occurrence data. From this analysis, the expert or expert system selects candidate query terms, thus forming a list. The list of candidate query terms should represent each of the thematic candidate terms. The expert or expert system must then define an efficient query from the list of candidate query terms. Criteria to be considered for selecting a query term from the list of candidate query terms that will retrieve more relevant records include, but are not limited to, the following:
Criteria to be considered for selecting a query term from the list of candidate query terms that will eliminate non-relevant records include, but are not limited to, the following:
Generally, a query term tracking system marks (i.e., tags) each text element (term) selected for the query, as well as all text elements that would retrieve a subset of the total number of documents retrieved by the selected term. That is, if one selects the term "lithium battery" as a query text element, the tracking system automatically marks as previously selected the term "secondary lithium battery". This marking system avoids duplication of effort and redundancy, since all documents discovered using the term "secondary lithium battery" would have been already discovered using the term "lithium battery". The tracking system is best handled by a computer, but for very small searches may be done manually. For large searches, where large numbers of candidate query terms exist, this type of tracking system is mandatory for credible term selection feasibility. This tracking system is another advance of the present invention. Where the relevance classification scheme is binary (relevant/not relevant), this comparison may be readily performed by comparing the number of occurrences of a text element within relevant retrieved records to the number of occurrences of that text element within non-relevant retrieved records. Where the relevance classification scheme is other than binary, each class of relevance may be assigned a numerical value (e.g., highly relevant=1, moderately relevant=0.5, not relevant=0). The occurrence of each text element in the record is then multiplied by the numerical value assigned to the relevance of the record, to provide a numerical rating for each text element. To obtain a relevance-weighted frequency rating for a given text element, the frequency ratings for that text element are summed over all records. To obtain a non-relevance weighted frequency rating for a given text element, the occurrence of each text element in the record is then multiplied by one minus the numerical value assigned to the relevance of the record (if relevance is graded from zero to one) to provide a numerical rating for each text element. The ratio of the relevance-weighted frequency rating for a given text element to its non-relevance weighted frequency rating can then be used to determine the value of a search term in the same manner as a binary rating system would use the ratio of the number of relevant records containing that text element to the number of non-relevant records containing a text element. This ratio suggests the usefulness of a text element as a search term. For example, a term with a high ratio would be considered for use with the "AND" connector, while a term with a low ratio might be considered for use with a "NOT" connector to eliminate less relevant or non-relevant records. Generally, for each iteration in a search, a figure of merit may be used to determine the efficiency or value of the search at that iteration. Typically, when the slope of that figure of merit approaches zero, the addition of further search terms will yield little or no new relevant records. For example, after each iteration, one may determine the total number of new relevant records retrieved (for systems with more than two relevance ratings, the count for each record is weighted according to its relevance rating, i.e., a record with a relevance rating of 0.5 counts as one-half record). When this total drops sharply, the marginal utility of additional search terms will be sufficiently low that the user may wish to discontinue further searching. The development of this marginal utility capability for selecting efficient queries represents another advance in the present invention (see Example 3). When the marginal utility is of this form (i.e., additional relevant records retrieved per additional query term), it is implicitly assumed that either the ratio of additional relevant records to non-relevant records retrieved will be above a threshold floor value, or the ratio of total relevant records to total non-relevant records will be above a threshold floor value. Once the new query is defined, the new query (which may be a modification of the test query or may be an entirely new query) is applied to the same collection of documents (source database) to which the test query was applied. Application of the new query retrieves an additional set of documents from the collection. The present invention also includes a text element method of determining levels of emphasis that is an alternative to using document clustering (and counting documents assigned to various categories) for determining levels of emphasis. Using the methods discussed above, a taxonomy of a collection of documents containing at least one unstructured field is generated, either statistically or manually. Text elements are statistically or manually assigned to each group (category) within the taxonomy. Within each group of the taxonomy, the frequencies of occurrence for the text elements in that group are summed. The summation cannot include the frequency component of text elements nested within other text elements. The figure of merit, the summation of text element frequencies within each group, indicates the relative emphasis placed on each group by the collection of documents. The development of this alternative to document clustering for determining levels of emphasis represents another advance in the present invention (See Example 4). The present invention can also be used for citation mining. For citation mining, the user selects one or more documents before creating the collection to be studied. The collection to be studied can then be created, typically using a citation index, so that all documents within the collection either cite or are cited by the selected document or documents. This collection of documents is then subjected to text mining as described above. The development of a process for citation mining represents another advance in the present invention (See Example 5). The present invention also includes a method of literature-based discovery. In a first approach, the user selects a problem and a collection of records believed to be relevant to the problem (problem literature). The problem literature is generally a subset of a larger collection (usually orders of magnitude larger) of records referred to herein as the "source database." Each record within the source database includes at least one unstructured field. Information retrieval and information processing, including text element extraction, text element-frequency analysis, and clustering (statistical and/or non-statistical) are performed. As a result, the text elements are grouped into thematic categories and subcategories. Next a directly related topical literature is generated for each subcategory. The directly related topical literatures should be disjoint (that is, independent of each other (i.e., no overlapping records) and independent of the problem literature (i.e., no overlapping records)). Directly related topical literatures are literatures whose queries are essentially generated from the problem literature. (By "essentially," it is meant that text elements with conceptually similar meanings, such as synonyms, in addition to phrases taken directly from the problem literature, may be used). To generate directly related topical literatures, a query is developed for each subcategory, recognizing that each literature is representative of one of the subcategories of the taxonomy. Many of the text elements for the query can come from the text elements in the taxonomy. However, if the text elements retrieve only a narrow representation of a category, then the query should be expanded to include synonyms (or additional synonyms) for the text elements from the taxonomy to provide a more complete representation of the category. The query is inserted into the search engine and retrieves the directly related topical literatures (for each subcategory) from the source literature. Each subcategory directly related topical literature is subjected to a text element frequency analysis, to generate a list of text elements for each directly related topical literature. Text elements on that list also found in the problem literature are removed from each list. The remaining text elements in each list become candidates for discovery since they could not be found in the problem literature. Both the number of lists in which a candidate text element appears and the frequencies with which the candidate text element appears in the lists compared to its appearance in the overall database may be used to rank the priority among the candidates. Use of this type of text element frequency comparison with the source literature for ranking, however, can sometimes overlook candidates that are related to a variety of conditions. Thus, according to the present invention, text element frequency comparison with the source literature will typically be used for ranking candidates less frequently, and with a lower weight, than the number of lists in which a candidate text element appears. Text element co-occurrences can also be reviewed and ranked as potential candidates. Typically, the text elements (and text element co-occurrences) developed from the directly related topical literature fall into three categories: 1. not candidates for discovery (typically overly generic); 2. solution candidates (by inclusion or omission of act or material); and 3. candidate query terms to develop indirectly related intermediate literatures. The indirectly related intermediate literatures are then retrieved from the source database by applying the candidate query terms to the source literature. The retrieved records are subjected to text element frequency analysis to generate a list of text elements for each indirectly related topical literature. Text elements found in the problem literature are removed from each list. The remaining text elements in each list become candidates for further discovery since they could not be found in the problem literature. Both the number of lists that a candidate text element appears in and the frequencies with which the candidate text element appears in the lists compared to its appearance in the overall database may be used to rank the priority among the candidates. Use of type of text element frequency comparison with the source literature for ranking, however, can sometimes overlook candidates that are related to a variety of conditions. Thus, according to the present invention, text element frequency comparison with the source literature will typically be used for ranking candidates less frequently, and with a lower weight, than the number of lists in which a candidate text element appears. Text element co-occurrences can also be reviewed and ranked as potential candidates. (See Example 6) The above process with the indirectly related literature may be repeated as often as desired to identify text elements in higher order indirectly related literatures, although acceptable results are often obtained without further searches for indirectly related literature. Also, as one drifts further from the directly related literature, the candidate text elements are less likely to have a direct impact on the problem. Another approach is start with a solution (technology), and then look for a problem (application) upon which the solution may have an impact. This approach basically follows the same steps as used from the problem based approach, i.e., it is analogous to that approach. The user selects a solution and a collection of records believed to be relevant to the solution (solution literature). The solution literature is generally a subset of a larger collection (usually orders of magnitude larger) of records referred to herein as the "source database" Information retrieval, and information processing, including text element extraction, text element-frequency analysis, and clustering (statistical and/or non-statistical) are performed. As a result, the text elements are grouped into thematic categories and subcategories. Next a directly related topical literature is generated for each subcategory. The directly related topical literatures should be disjoint (that is, independent of each other (i.e., no overlapping records) and independent of the solution literature (i.e., no overlapping records)). Directly related topical literatures are literatures whose queries are essentially generated from the solution literature. (By "essentially," its meant that text elements with conceptually similar meanings, such as synonyms, in addition to text elements taken directly from the solution literature may be used). To generate directly related topical literatures, a query is developed for each subcategory, recognizing that each literature is representative of one of the subcategories of the taxonomy. Many of the text elements for the query can come from the text elements in the taxonomy. However, if the text elements retrieve only a narrow representation of a category, then the query should be expanded to include synonyms (or additional synonyms) for the text elements from the taxonomy to provide a more complete representation of the category. The query is inserted into the search engine and retrieves the directly related topical literatures (for each subcategory) from the source literature. Each subcategory directly related topical literature is subjected to a text element frequency analysis, to generate a list of text elements for each directly related topical literature. Text elements found in the solution literature are removed from each list. The remaining text elements in each list become candidates for (applications) discovery since they could not be found in the solution literature. Both the number of lists in which a candidate text element appears and the frequencies with which the candidate text element appear in the lists compared to its appearance in the overall database may be used to rank the priority among the candidates. Use of this type of text element frequency comparison with the source literature for ranking, however, can sometimes overlook candidates that are related to a variety of conditions. Thus, according to the present invention, text element frequency comparison with the source literature will typically be used for ranking candidates less frequently, and with a lower weight, than the number of lists in which a candidate text element appears. Text element co-occurrences can also be reviewed and ranked as potential candidates. Typically, the text elements (and text element co-occurrences) developed from the directly related topical literature fall into three categories: 1. not candidates for discovery (typically overly generic); 2. application candidates (by inclusion or omission of act or material); and 3. candidate query terms to develop indirectly related intermediate literatures. The indirectly related intermediate literatures are then retrieved from the source database by applying the candidate query terms to the source literature. The retrieved records are subjected to text element frequency analysis to generate a list of text elements for each indirectly related topical literature. Text elements found in the solution literature are removed from each list. The remaining text elements in each list become candidates for further (applications) discovery since they could not be found in the solution literature. Both the number of lists in which a candidate text element appears, and the frequencies with which the candidate text element appears in the lists compared to its appearance in the overall database, may be used to rank the priority among the candidates. Use of type of text element frequency comparison with the source literature for ranking, however, can sometimes overlook candidates that are related to a variety of conditions. Thus, according to the present invention, text element frequency comparison with the source literature will typically be used for ranking candidates less frequently, and with a lower weight, than the number of lists in which a candidate text element appears. Text element co-occurrences can also be reviewed and ranked as potential candidates. The above process with the indirectly related literature may be repeated as often as desired to identify text elements in higher order indirectly related literatures, although acceptable results are often obtained without further searches for indirectly related literature. Also, as one drifts further from the directly related literature, the candidate text elements are less likely to have a direct impact. In yet another approach, the user may research the mechanism that links a solution to the problem to which it applies. In this approach, the user conducts the problem-based literature-based discovery and the solution-based literature-based discovery as described above, resulting in two separate lists of query terms. The two lists are then compared to determine the text elements that they have in common. These shared text elements represent mechanisms that potentially link the problem with the solution. The development of a systematic context-based clustering process for literature-based discovery represents another advance in the present invention. (See Example 6). The present invention can also be used for literature-based asymmetry detection (see Example 7), another type of literature-based discovery. The objective is to identify differences in thematic categories where none would be expected, based on literature text element and/or document occurrences alone. For example, in a lung cancer literature, the objective might be to identify differences in patient incidence of right lung cancer vs. left lung cancer, or in patient incidence of upper lobe cancer vs. lower lobe cancer, and so on. The first step in literature-based asymmetry detection is to retrieve a set of documents that is representative of the topical literature of interest. In the lung cancer case, this set of documents (collection) could be all the documents in the Medline database that are lung cancer-related Case Reports (typically individual patient case reports written as journal articles). The next step is identical to that used for the context-dependent text element conflation and trivial text element filtering described previously. The narrative material in the collection is converted to text elements with associated occurrence frequencies. A correlation matrix of these text elements is generated, and then a factor matrix is generated using the correlation matrix. The factor loadings in each factor of the factor matrix are examined. Substantial differences in factor loadings for text elements representing phenomena thought to be symmetrical will identify candidate text elements for further examination. This is especially true in the factor loading region where at least one of the text elements has a sufficiently high factor loading to have a major influence on the factor theme. For example, in the lung cancer example shown in Example 7, suppose the text element "right lung" had a factor loading of 0.4, and the text element "left lung" had a factor loading of 0.2, in a given factor. Then the potential for lateral (left vs. right) asymmetry becomes a candidate for further investigation. The next step is to select those records from the collection that focus specifically on the elements of the potential asymmetry. In the lung cancer example, a query would be developed to select those records in the collection that focus specifically on right lung cancer, and those records in the collection that focus specifically on left lung cancer. Once these records have been selected, the ratio of records in each category is computed. This ratio is then used to estimate the degree of asymmetry reflected in the collection. If the collection is representative of the actual occurrence of the phenomena being examined, then the ratio can be used to estimate the degree of asymmetry of the occurrence of the phenomena. In the lung cancer example, if the lung cancer Medline Case Reports are assumed to be representative of actual lung cancer patient incidence, then the ratio can be used to estimate the actual right/left patient lung cancer incidence. The development of a systematic factor matrix filtering process for asymmetry detection represents another advance in the present invention. Having described the invention, the following examples are given to illustrate specific applications of the invention including the best mode now known to perform the invention. These specific examples are not intended to limit the scope of the invention described in this application. EXAMPLES Example 1 Factor Matrix Text Filtering and Clustering This example shows how factor analysis was used as a context-dependent word filter for cluster analysis, and demonstrates how the fractal nature of factor matrix-associated graphs affected the resultant number of factors used in the analysis. In the first part of this example, 930 Medline Abstract-containing records related to Raynaud's Phenomenon, and published in the 1975-1985 time period, were retrieved. Non-trivial single words (659) were extracted from the database of Abstracts, along with the number of documents in which each word appeared (document frequency). The co-occurrence of word pairs in the same document (word co-occurrence frequency) was computed, and a correlation matrix (659×659) of word pairs was generated. The variables were factorized, and a factor matrix was generated. The factor matrix was then used to select the sub-set of the 659 words that had the most influence in determining the theme of each factor. This sub-set of context-dependent important words was then input to the clustering algorithm. The core of this factor matrix-based filtering process was the factor matrix itself. Its rows were the input words/phrases, and columns were the number of factors used. A major challenge was selection of the number of factors to be analyzed, as well as grouped into a taxonomy. This example will also show that the fractal nature of the factor matrix selection process had to be taken into account when selecting the number of factors to be used in generating the factor matrix. The example starts with a discussion of factor matrices. Then, the fractal nature of the factor matrix selection process is shown using the Raynaud's Phenomenon database as an example. This is followed by a thematic analysis of two factor matrices. Then, the use of the factor matrix for filtering high technical content words for input to the hierarchical clustering algorithms is presented. The resulting clustering algorithm output is analyzed thematically, and a taxonomy is generated. The themes from the factor matrix analysis and from the hierarchical clustering analysis are compared. One of the key challenges in factor analysis has been defining the number of factors to select. The two most widely used factor number selection methods are the Kaiser criterion and the Scree test (1). The Kaiser criterion states that only factors with eigenvalues greater than unity should be retained, essentially requiring that a factor extracts at least as much variance as the equivalent of one original variable. The Scree test plots factor eigenvalue (variance) vs factor number, and recommends that only those factors that extract substantive variance be retained. Operationally, the factor selection termination point in the Scree test becomes the ‘elbow’ of the plot, the point where the slope changes from large to small. In this example, the location of the slope change point depended on the resolution level of the eigenvalue plot, and therefore had a fractal characteristic. In the example, once the desired value of the Scree Plot ‘elbow’ was determined, and the appropriate factor matrix was generated, the factor matrix was used as a filter to identify the significant technical words/phrases for further analysis. Specifically, the factor matrix complemented a basic trivial word list (e.g., a list containing words that are trivial in almost all contexts, such as ‘a’, ‘the’, ‘of’, ‘and’, ‘or’, etc) to select context-dependent high technical content words/phrases for input to a clustering algorithm. The factor matrix pre-filtering improved the cohesiveness of clustering by eliminating those words/phrases that are trivial words operationally in the application context. In the example, the Scree plot was used for factor number determination, since the Kaiser criterion yielded 224 factors. This number was far too large for detailed factor analysis, and of questionable utility, since many of the eigenvalues were not too different from unity. Factor matrices with different numbers of factors specified were computed. Eigenvalues were generated by Principal Components Analysis, and these eigenvalues represented the variance accounted for by each underlying factor. FIG. 1 shows the factor eigenvalue-factor number plot for the 659 un-rotated factors on a linear scale. The ‘elbow’, or break point, of the curve appeared to be about fourteen factors. To improve resolution, the curve was stretched in the x direction by halving the number of factors shown on one page. The curve had a similar shape to the 659 factor case, but the factor termination point appeared to decrease. The halving process was repeated until ten factors were plotted on one page, and the resolution effectively increased by an order of magnitude overall. FIG. 2 shows the ten factor plot. The elbow of the curve appeared to be about two factors. Thus, the number of factors selected based on significant slope change decreased from fourteen in the 659 factor plot to two in the ten factor plot. In fractal analysis, a fractal object has a number of characteristics. Among these are self-similarity (similar to itself at different magnifications), and adherence to a scaling relationship (the measured value of a property will depend on the resolution used to make the measurement). The Scree Plot had these two fractal properties. As the resolution increased, more structure appeared, and the value of the break point changed. The simplest and most common form of the scaling relationship is that of a power law. When such a power law is plotted on a log-log scale, the scaling relationship appears as a straight line. FIG. 3 is a plot of the break point on a linear scale, and FIG. 4 is a re-plot of FIG. 3 on a log-log scale. The log-log plot was approximately linear, reflected power law scaling, and validated the break point selection as a fractal process. 2) Factor Matrix Filtering The factor matrices determined by the various Scree Plots, ranging from two factor to fourteen factor, were examined. Only the results from the extremes, two and fourteen factor matrices, were examined. To diversify the factor loading patterns, and simplify interpretation of each factor, varimax orthogonal rotation was used. In the factor matrices used, the rows were the words and the columns were the factors. The matrix elements Mij were the factor loadings, or the contribution of word/phrase i to the theme of factor j. The theme of each factor was determined by those words that had the largest values of factor loading. Each factor had a positive value tail and negative value tail. For each factor, one of the tails dominated in terms of absolute value magnitude. This dominant tail was used to determine the central theme of each factor. Since each theme addressed some aspect of Raynaud's Phenomenon, an overview of Raynaud's Phenomenon will be presented before discussing the themes. Because the main Raynaud's terminology used in the literature was not consistent (in many cases, Raynaud's Disease was used interchangeably with Raynaud's Phenomenon or Raynaud's Syndrome), the overview will include the distinction among these Raynaud variants. Raynaud's Phenomenon Overview Raynaud's Phenomenon is a condition in which small arteries and arterioles, most commonly in the fingers and toes, go into spasm (contract) and cause the skin to turn pale (blanching) or a patchy red (rubor) to blue (cyanosis). While this sequence is normally precipitated by exposure to cold, and subsequent re-warming, it can also be induced by anxiety or stress. Blanching represents the ischemic (lack of adequate blood flow) phase, caused by digital artery vasospasm. Cyanosis results from de-oxygenated blood in capillaries and venules (small veins). Upon re-warming, a hyperemic phase ensues, causing the digits to appear red. Raynaud's Phenomenon can be a primary or secondary disorder. When the signs of Raynaud's Phenomenon appear alone without any apparent underlying medical condition, it is called Primary Raynaud's, or formerly, Raynaud's Disease. In this condition, the blood vessels return to normal after each episode. Conversely, when Raynaud's Phenomenon occurs in association with an underlying condition or is due to an identifiable cause, then it is referred to as Secondary Raynaud's, or formerly, as Raynaud's Syndrome. The most common underlying disorders associated with Secondary Raynaud's are the auto-immune disorders, or conditions in which a person produces antibodies against his or her own tissues. In contrast to Primary Raynaud's, where the blood vessels remain anatomically normal after each episode, in Secondary Raynaud's there may be scarring and long-term damage to the blood vessels; thus Secondary Raynaud's is potentially a more serious disorder than Primary. Certain repetitive activities may result in a predisposition to Raynaud's Phenomenon. These cases of so-called "Occupational Raynaud's" typically result from the chronic use of vibrating hand tools. Thus, while Raynaud's Phenomenon is a direct consequence of reduced blood flow due to reversible blood vessel constriction, it may be a function of many variables that can impact blood flow. These include:
For the fourteen factor matrix, the high factor loading words in the dominant tail of each factor are shown in parentheses after the factor number, followed by a brief narrative of the factor theme. Factor 1 (nuclear, antibodies, extractable, speckled, connective, immunofluorescence, antinuclear, tissue, anti-RNP, MCTD, mixed, ribonucleoprotein, swollen, RNP, antibody, antigen, titer, SLE, lupus, erythematosus) focused on different types of autoantibodies, especially anti-nuclear and extractable nuclear, and their relation to auto-immune diseases. Factor 2 (double-blind, placebo, mg, daily, weeks, times, agent, nifedipine, trial) focused on double-blind trials for vasodilators. Factor 3 (vibration, tools, workers, vibrating, exposure, chain, prevalence, time, exposed, sensory, white, circulatory, complaints) focused on the impact of vibratory tools on circulation. Factor 4 (coronary, ventricular, heart, angina, hypertension, myocardial, cardiac, failure, pulmonary) focused on coronary circulation and blood pressure problems. Factor 5 (prostaglandin, platelet, E1, prostacyclin, aggregation, infusion, hours, healing, ischaemic, thromboxane, administered, vasodilator, intravenous) focused on the administration of vasodilators to improve circulation. Factor 6 (calcinosis, sclerodactyly, esophageal, dysmotility, telangiectasia, anticentromere, variant, diffuse, scleroderma) focused on scleroderma-spectrum types of autoimmune diseases. Factor 7 (extremity, sympathectomy, artery, surgery, arteries, upper, occlusions, arterial, brachial, thoracic, operation, surgical, angiography, occlusive) focused on surgical solutions to remove constrictions on circulation. Factor 8 (C, degrees, systolic, pressure, cooling, blood, finger, measured, flow) focused on blood flow, and associated finger blood pressure and temperature measurements. Factor 9 (capillaries, capillary, nail-fold, microscopy, capillaroscopy) focused on the diagnostic use of nail-fold capillary microscopy. Factor 10 (training, biofeedback, relaxation, stress, outcome, measures, headaches, temperature, conducted, thermal, physiological, responses) focused on the use of biofeedback training to reduce stress headaches, and raise temperatures through improved circulation. Factor 11 (vasodilation, peripheral, immersion, calcium, water) focused on vasodilation of the peripheral circulatory system after immersion, and the role of calcium in this process. Factor 12 (complexes, immune, circulating, complement, IgG, serum, levels, IgM) focused on serum levels of circulating immune complexes and immunoglobulins, especially IgG and IgM. Factor 13 (eosinophilia, fasciitis, fascia, eosinophilic, visceral, hypergammaglobulinemia, absent, scleroderma-like, corticosteroids) focused on inflammation, especially of the fascia. Factor 14 (systemic, lupus, RA, erythematosus, PSS, sclerosis, rheumatoid, arthritis, SLE) focused on autoimmune diseases associated with Raynaud's Phenomenon. Two Factor Matrix Factor 1 (placebo, double-blind, mg, weeks, degrees, C, patients, attacks, measured, daily, P, crossover, trial, thromboxane, systolic, pressure, blood, temperature, agent, inhibitor, prostaglandin, nifedipine) had a circulation focus, specifically double-blind trials on coronary and peripheral circulation vasodilators. Factor 2 (antibodies, nuclear, antinuclear, connective, lupus, tissue, systemic, erythematosus, antibody, immunofluorescence, speckled, sera, SLE, extractable, antigen, arthritis, mixed, anti-RNP, rheumatoid, ribonucleoprotein, MCTD, CREST, serum, features, antigens) had an auto-immune focus, specifically the study of (mainly anti-nuclear) autoantibodies and their relation to inflammation-based auto-immune diseases. Thus, the two factor matrix showed the main thematic thrusts of circulation and auto-immunity (as were verified by the results of the clustering analysis). The fourteen factor matrix themes were divided into these two thrusts, where circulation covered factors 2, 3, 4, 5, 7, 8, 9, 10, and 11, and autoimmunity covered factors 1, 6, 12, 13, 14. The factor themes from the fourteen factor matrix were more detailed, and to some degree represented the next sub-categorization of the themes from the two factor matrix. Factor Matrix Word Filtering and Selection Because of the greater specificity of the themes in the fourteen factor matrix, and the desire to have the capability to do multi-level hierarchical categorization in the clustering, the fourteen factor matrix was used for word filtering and selection. In the present experiment, the 659 words in the factor matrix had to be culled to the 250 allowed by the Excel-based clustering package, WINSTAT. The 250 word limit is an artifact of Excel. Other software packages may allow more or less words to be used for clustering, but all approaches perform culling to reduce dimensionality. The filtering process presented here was applicable to any level of filtered words desired. Another caveat. A trivial word list of the type described previously (words that are trivial in almost all contexts) was used to arrive at the 659 words used for the factor matrix input. This was not necessary. The raw words from the word generator could be used as input, and would be subject to the same filtering process. To allow more important words to be used in this demonstration, the very trivial words were removed. The factor loadings in the factor matrix were converted to absolute values. Then, a simple algorithm was used to automatically extract those high factor loading words at the tail of each factor. If word variants were on this list (e.g., singles and plurals), and their factor loadings were reasonably close, they were conflated (e.g., ‘agent’ and ‘agents’ were conflated into ‘agents’, and their frequencies were added). See example 2 for more detail about conflation. A few words were eliminated manually, based on factor loading and estimate of technical content. An examination of the words eliminated and those retained showed that most of those retained appeared to have high technical content, and would have been selected by previous manual filtering processes for input to the clustering algorithms. Some of the words appeared not to have the highest technical content, also as shown above, but it was concluded that they were important because of their contribution to theme determination in the present clustering application. Similarly, some of the words eliminated by the factor matrix filter appeared to be high technical content, and in previous manual filtering processes might have been selected for the clustering algorithm input (e.g., acrocyanosis, vasomotor, cerebral, gastrointestinal). The conclusion for these words was not that they were unimportant per se. Rather, they did not have sufficient influence in determining the factor themes, and would not make an important contribution to the cluster structure determination. Thus, the context dependency (their influence on factor theme determination) of the words was the deciding factor in their selection or elimination, not only the judgment of their technical value independent of factor theme determination, as was done in previous manual filtering approaches. Word Clustering The 252 filtered and conflated words were input to the WINSTAT clustering algorithm, and the Average Link option was selected for clustering. A dendrogram was generated. This was a tree-like structure that showed how the individual words clustered into groups in a hierarchical structure. One axis was the words, and the other axis (‘distance’) reflected their similarity. The lower the value of ‘distance’ at which words, or word groups, were linked together, the closer their relation. As an extreme case of illustration, words that tended to appear as members of multi-word phrases, such as ‘lupus erythematosus’, ‘connective tissue’, or ‘double blind’ appeared adjacent on the dendrogram with very low values of ‘distance’ at their juncture. The top three hierarchical levels were determined, as follows: The top hierarchical level was divided into two major clusters. Cluster 1 focused on autoimmunity, and cluster 2 focused on circulation. The second hierarchical level was divided into four clusters, where cluster 1 was divided into clusters 1a and 1b, and cluster 2 was divided into clusters 2a and 2b. Cluster 1a focused on autoimmune diseases and antibodies, while cluster 1b focused on inflammation, especially fascial inflammation. Cluster 2a focused on peripheral vascular circulation, while cluster 2b focused on coronary vascular circulation. Most of the clusters in the second hierarchical level were divided into two sub-clusters, to produce the third hierarchical level clusters. Cluster 1a1 had multiple themes: different types of antibodies, especially anti-nuclear and extractable nuclear, and their relation to autoimmune diseases; sclerotic types of autoimmune diseases; and autoimmune diseases associated with Raynaud's Phenomenon. It incorporated the themes of factors 1, 6, and 14. Cluster 1a2 focused on circulating immune complexes, and paralleled the theme of factor 12. Cluster 1b was too small to subdivide further, and stopped at the second hierarchical level. It paralleled the theme of factor 13. Cluster 2a1 had multiple themes: double-blind clinical trials for vasodilators; administration of vasodilators to reduce platelet aggregation and improve circulation; blood flow, and associated finger blood pressure and temperature measurements; and occupational exposures, mainly vibrating tools and vinyl chloride, that impact the peripheral and central nervous systems and impact circulation. It incorporated the themes of factors 2, 3, 5, 7, 8. Cluster 2a2 focused on nailfold capillary microscopy as a diagnostic for micro-circulation, and paralleled the theme of factor 9. Cluster 2b1 focused on cardiovascular system problems, and paralleled the theme of factor 4. Cluster 2b2 focused on biofeedback training to reduce stress and headaches, and increase relaxation, and paralleled the theme of factor 10. In summary, factor matrix filtering proved to be an effective method for:
Selecting the number of factors for the factor matrix was complex, and the fractal nature of the Scree Plot had to be considered for final factor selection. Factor matrix filtering was used as a precursor for text element clustering. It eliminated words that had little influence on determining the factor themes, and that effectively served as trivial, or ‘noise’, words. It effectively pre-processed the raw text to eliminate the background clutter, and allowed the processed text to be used for any application where clutter removal is required. Example 2 Context-Dependent Conflation This example showed that word stemming in text processing was strongly context and application dependent, and that selection of word variants for stemming was context/application dependent. In addition, this example showed that the conflation filter rule proposed in (2) did not have a strong rational basis. A simple experiment was run, as part of a larger text mining study on the Fractals literature, to test the effect of word stemming on cluster theme definition. A Fractals-based query retrieved 4389 Science Citation Index records containing Abstracts, covering the period 2001-October 2002. All the single Abstract words were extracted, and the highest frequency highest technical content words (820) were selected for word clustering. A two step clustering process was used, where a factor matrix was generated initially with no word combination required, then a hierarchical clustering was performed using word combinations based on the factor matrix results. The factor matrix generator in the TechOasis software package used a correlation matrix of the uncombined 820 words as input. The generator produced a 29 factor matrix (820×29), where each factor represented a theme of the Fractals database. The value of each matrix element Mij was the factor loading, the contribution of word i to factor j. For the analysis of each factor, the factor column was sorted in descending numerical order. Each factor had two tails, one with large positive value and one with large negative value. The tails were not of the same absolute value size; one of the tails was always dominant. The theme of each factor was determined by the highest absolute value terms in the dominant tail. For purposes of this example, the interchangeability of the singular and plural variants only was reported and discussed, although the results of interchangeability of all the word variants in the 820 word list were used to determine the word combinations input to the hierarchical clustering algorithm. All words were examined that had both singular and plural forms represented in the 820 words, especially where at least one of the variants was contained in the dominant tail of a factor and thereby was influential in determining the theme of the factor. Singular and plural forms that could be conflated credibly were interchangeable. They were located in close proximity in the dominant tail (similar factor loadings), and had similar influence in determining the cluster theme. Otherwise, they were being used in different contexts, and their conflation had the effect of artificially merging themes or clusters to produce erroneous groupings. One benchmark for how well the factor matrix algorithm spotted interchangeability was its numerical performance with multi-word phrases. In the Fractals literature, there were multi-word phrases that appeared frequently, where each word in the multi-word phrase was either exclusive to the phrase, or used frequently in the phrase. Examples are: Atomic Force Microscopy and its acronym AFM, Scanning Electron Microscopy and its acronym SEM, Thin Film, Fractional Brownian Motion and its acronym FBM, and Monte Carlo. The component words of these strong multi-word phrases appeared close to each other in the dominant tail, when the clustering was viewing them as a unit. The dominant factor tails that included the multi-word phrases above, and the word factor loadings (in parenthesis) were as follows.
The threshold absolute value for high factor loading across all factors was about 0.20, and the highest absolute value for factor loading across all factors was about 0.70. All the words above were well above the threshold and at or near the end of the dominant tail in their respective factor. All the multi-word phrase components had high factor loadings in close proximity, with words relatively unique to the multi-word phrase being in very close proximity. The performance of singular and plural variants was then examined. There was a continuum of relative values between the singular and plural variants, and only the extremes were used to illustrate the main points. Singular/plural variants had a high absolute value factor loading in one factor only. Low value factor loadings did not determine the factor theme. However, it was clear that variants closely related in their dominant tail appearance also tended to be closely related in most of their appearances in other factors. Variants not closely related in their dominant tail appearance tended not to be closely related in appearances in other factors. Sample closely-related singular-plural variants, accompanied by their factor loadings/factors in parenthesis, were as follows: avalanche (0.453/10), avalanches (0.502/10); earthquake (0.599/17), earthquakes (0.541/17); gel (0.539/18), gels (0.495/18); island (0.42/24), islands (0.38/24); network (0.49/21), networks (0.45/21). Sample disparately-related singular-plural variants included: angle (0.31/23), angles (0.08/23); control (-0.25/21), controls (-0.01/21); electron (-0.40/6), electrons (-0.02/6), force (-0.52/6), forces (0.01/6), state (-0.26/10), states (-0.01/10). Thus, the closely-related singular-plural variants had similar high factor loadings, and were conflated with minimal impact on the clustering results, since they were acting interchangeably in the clustering context. The disparately-related singular-plural variants had one high and one low factor loading, and were not conflated, since they were operationally different concepts with similar superficial appearance. It should be strongly emphasized that the metric used for conflation justification was interchangeability, not co-occurrence of the variants in the same document, as proposed by (2). While intra-document co-occurrence may have been operable under some scenarios, there was no a priori reason that it should have been stated as a condition, metric, or requirement. One could have easily envisioned a corpus where singular-plural variants never co-occurred in the same document, yet behaved interchangeably (or didn't behave interchangeably). For example, a corpus of small documents, such as Titles or Abstracts, might not have contained word variants in the same document, but could have contained word variants behaving interchangeably even though they were in different documents. The condition to require was that the variants should have correlated or co-occurred similarly with other words in the corpus for the purpose of the application context. Thus, their variant was transparent from the perspective of the other words in the specific context of the application. Reference (2) would have had a much more credible condition had the metric been co-occurrence similarity of each word variant with other (non-variant) words in the text, rather than high co-occurrence with other forms of the variant. Once the conflation-justified variants were identified by the factor matrix filter, they were then combined to lower the dimensionality of the system, and used to generate a co-occurrence matrix. This 250 word square matrix was imported into an Excel statistical package add-in named WINSTAT (Excel has an approximate 250 column limitation), and used as the basis for a multi-link clustering algorithm. In summary, credible conflation was shown to be context and application sensitive. The metric for determining conflation credibility should have been driven by the context and application. For the clustering application described in this example, correlation-driven interchangeability was the appropriate metric, rather than the variant co-occurrence-based metric proposed in (2). Example 3 Formation Retrieval/Marginal Utility/Tracking This example describes an iterative full-text information retrieval approach based on relevance feedback with term co-occurrence and query expansion (Simulated Nucleation). The method generated search terms from the language and context of the text authors, and was sufficiently flexible to apply to a variety of databases. It provided improvement to the search strategy and related results as the search progressed, adding relevant records to the information retrieved and subtracting non-relevant records as well. Finally, it allowed maximum retrieval of relevant records with high signal-to-noise ratio by tracking marginal utility of candidate query modification terms using a semi-automated optimization procedure. The method was applied to information retrieval for the technical discipline of textual data mining (TDM). In Simulated Nucleation for information retrieval, the purpose was to provide a tailored database of retrieved documents that contained all relevant documents from the larger literature. In the initial step of Simulated Nucleation, a small core group of documents mainly relevant to the topic of interest was identified by the topical domain experts. An inherent assumption was then made that the bibliometric and phrase patterns and phrase combinations characteristic of this relevant core group would be found to occur in other relevant documents. These bibliometric and phrase patterns and phrase combinations were then used to expand the search query. While both bibliometrics and computational linguistics were used in Simulated Nucleation to identify unique characteristics of each category, the bulk of the development effort has concentrated on the computational linguistics. Therefore, the bulk of the remainder of this example will address the computational linguistics. There were two major Simulated Nucleation approaches for expanding the number of relevant documents and contracting the number of non-relevant documents. The first was a manually intensive approach that required the reading of many sample Abstracts to separate the relevant from non-relevant documents, and then identified candidate query terms from computational linguistics analysis of each document category. The second was a semi-automated approach that used computer-based document clustering techniques for separating the relevant from non-relevant records, but still required manual identification of candidate query terms from computational linguistics analysis of each separate document category. Since the first approach provided somewhat more accurate results, albeit requiring substantially more time and labor, it will be the only approach described in detail. The operational objective of Simulated Nucleation was to generate a query that had the following characteristics:
To achieve these objectives, the Simulated Nucleation process contained the following steps:
Each of these steps will now be described in more detail. The process began with a definition of the scope of the study by all participants. Within the context of this scope, an initial query was constructed. (Since each iterative step follows the same procedure, only one iterative step from the study of TDM will be described.) Queries were scope dependent. Typically, when a new scope was defined, a new query was developed. However, due to the iterative nature of Simulated Nucleation, when the scope became more focused within the overall topical domain as the study proceeded, the new scope was accommodated within succeeding iterations. Such a scope sharpening did occur during the course of the illustrative TDM example, and the accommodation of the new scope within the iterative process will be summarized now. For the TDM study example, the initial TDM scope was defined as retrieving records related to textual data mining in the larger context; i.e., including information retrieval. As the study proceeded, the scope was restricted to documents that focused on understanding and enhancing the quality of the TDM process, as opposed to using standard TDM approaches to perform specific studies. The next step in the Simulated Nucleation process was generation of a query development strategy. Past experience with Simulated Nucleation has shown that the structure and complexity of a query were highly dependent on:
These query dependencies were taken into account when structuring the initial query. Different initial queries eventually evolved to similar final queries through the iterative process. However, higher quality initial queries resulted in a more streamlined and efficient iterative process. Specifically, one of the key findings from ongoing text mining studies was that, in general, a separate query had to be developed for each database examined. Each database accessed a particular culture, with its unique language and unique types of documentation and expression. A query that optimized (retrieved large numbers of desirable records with high signal-to-noise ratio) for one database within the context of the study objectives was sometimes inadequate for another database. For example, a text mining study published in 2000 focused on the R&D of the aircraft platform. The query philosophy was to start with the generic term AIRCRAFT, then add terms that would expand the numbers of aircraft R&D records (mainly journal paper Abstracts) retrieved and would eliminate records not relevant to aircraft R&D. Two databases were queried, the Science Citation Index (SCI—a database accessing basic research records) and the Engineering Compendex (EC—a database accessing applied research and technology records). The SCI query required 207 terms and three iterations for an acceptable signal-to-noise ratio, while the EC query required 13 terms and one iteration to produce an even better signal-to-noise ratio. Because of the technology focus of the EC, most of the records retrieved using an aircraft or helicopter type query term focused on the R&D of the aircraft platform itself, and were aligned with the study goals. Because of the research focus of the SCI, many of the records retrieved focused on the science that could be performed from the aircraft platform, rather than the R&D of the aircraft platform, and were not aligned with the study goals. Therefore, no adjustments were required to the EC query, whereas many negation terms (NOT Boolean terms) were required for the SCI query to eliminate aircraft records not aligned with the main study objectives. In TDM, queries, as well as follow-on computational linguistics analyses, sometimes provided misleading results if applied to one database field only. The text fields (Keywords, Titles, Abstracts) were used by their originators for different purposes, and the query and other computational linguistics results sometimes provided a different picture of the overall discipline studied based on which field was examined. As an example, in the aircraft study referenced previously, queries were applied to all text fields (Keywords, Titles, Abstracts) simultaneously. However, follow-on phrase frequency analyses for TDM were performed on multiple database fields to gain different perspectives. A high frequency Keyword focal area concentrated on the mature technology issues of longevity and maintenance; this view of the aircraft literature was not evident from the high frequency Abstract phrases. The lower frequency Abstract phrases had to be accessed to identify thrusts in this mature technology/longevity/maintenance area. Also, the Abstract phrases from the aircraft study contained heavy emphasis on laboratory and flight test phenomena, whereas there was a noticeable absence of any test facilities and testing phenomena in the Keywords. There was also emphasis on high performance in the Abstract phrases, a category conspicuously absent from the Keywords. In fact, the presence of mature technology and longevity descriptors in the Keywords, coupled with the absence of high performance descriptors, provided a very different picture of aircraft literature research from the presence of high performance descriptors in the Abstract phrases, coupled with the absence of mature technology and longevity/maintenance descriptors. The TDM analytical procedure in which Simulated Nucleation was imbedded and the query construction were not independent of the analyst's domain knowledge; they were, in fact, expert-centric. The computer techniques played a strong supporting role, but they were subservient to the expert, and not vice versa. The computer-derived results helped guide and structure the expert's analytical processes; the computer output provided a framework upon which the expert constructed a comprehensive story. The final query and study conclusions, however, reflected the biases and limitations of the expert(s). Thus, a fully credible query and overall analysis required not only domain knowledge by the analyst(s), but probably domain knowledge representing diverse backgrounds (i.e., multiple experts). It was also found useful in past and ongoing text mining studies to incorporate a generalist with substantial experience in constructing queries and analyzing different technical domains. This person identified efficient query terms and unique patterns for that technical domain not evident to the more narrowly focused domain experts. Constructing an R&D database query that will retrieve sufficient technical documents to be of operational use was not a simple procedure. It required:
There were two generic types of query construction philosophy that have been used with Simulated Nucleation. One philosophy started with relatively broad terms, and built the query iteratively. Many of the additional terms were non-relevant to the scope of the study due to the multiple meanings the more general terms may be assigned. Some query modification procedure was required to eliminate non-relevant records. For example, in the aircraft R&D study, this general approach was used. The query started with AIRCRAFT, and then was modified to remove terms that would result in retrieving aircraft records not related to the R&D of the aircraft platform. While the emphasis of these later iterations was reduction of non-relevant records, there were terms added to the query that retrieved new records. The other philosophy started with relatively specific terms, and built the query iteratively as well. Most of the additional query terms retrieved relevant records. Because of the specificity of the query terms, records relating to the more general theme and scope of the study were, in some cases, overlooked. Also, within both philosophies, if multiple iterations were used, the focus was different for each iterative step in the temporal sequence. The earlier iterations emphasized adding query terms to expand the number of relevant records retrieved, while the later iterations emphasized modifying the query to reduce the number of non-relevant records retrieved. Each iteration allowed new related literatures to be accessed, and additional relevant records to be retrieved. However, additional time and money were required for each added iteration, because of the intense analysis required per iteration. In practice, the two main limiting parameters to the length of a study were number of iterations and resources available. Two practical cases of interest were addressed. The first case resulted from severe resource constraints. In this case, the objective was to minimize the number of iterations required to develop the query subject to a threshold signal-to-noise ratio on retrieved records. The strategy for a single iteration query was to generate a test query (initial guess), categorize the retrieved records into relevant and non-relevant bins, apply computational linguistics to each bin, and select only those phrases and phrase combinations that are strongly characteristic of the relevant bin for the modified query. The ratio for phrase selection cutoff was determined by the marginal utility of each phrase as a query term. The resulting records retrieved with this modified query had very high signal-to-noise ratio, as confirmed by sampling a few records retrieved with this modified query. However, their coverage was limited. The more generic terms that could have retrieved additional relevant records (along with some non-relevant records) were not employed. The second case resulted from relaxed resource constraints. In this case, the objective was to maximize the number of records retrieved subject to a threshold signal-to-noise ratio. The general strategy for multiple iteration query development was to focus the initial iterations on expanding the number of relevant records retrieved, including the addition of non-relevant records, and then devote the last iteration mainly to eliminating the non-relevant records. A two iteration query development was used to illuminate the concept. The strategy for the first iteration of a two iteration signal maximization query was to generate a test query (initial guess), categorize the retrieved records into relevant and non-relevant bins, apply computational linguistics to each bin, and select only those phrases and phrase combinations that were moderately to strongly characteristic of the relevant bin for the modified query. The resulting records retrieved with this modified query had a modest signal-to-noise ratio. However, their coverage was expanded relative to the previous (single iteration) case. The more generic terms that could retrieve additional relevant records (along with some non-relevant records) were employed. The strategy for the second iteration of the two iteration signal maximization query was to use the modified query generated from the first iteration as a starting point, and categorize the retrieved records into relevant and non-relevant bins. Then, computational linguistics was applied to each bin, and those phrases and phrase combinations that are strongly characteristic of the non-relevant bin for the modified query were selected. Since new phrases resulted from the expanded relevant records retrieved by the modified first iteration query, some phrases and phrase combinations that were very strongly characteristic of the relevant bin were also added. Again, the threshold ratio for phrase selection cutoff was determined by the marginal utility of each phrase as a query term. Then, these mainly negation phrases were added to the second iteration starting point query to produce the final modified query. The resulting records retrieved with this final modified query had a very high signal-to-noise ratio, as confirmed by sampling a relatively few records retrieved by this final query, and their coverage was expanded relative to the previous case. In the truly resource unlimited case where the number of iterations were relatively unbounded, the following approach was taken. The number of relevant records after each iteration were plotted as a function o | ||||||
