Ranking of query feedback terms in an information retrieval system6363378Abstract An information retrieval system processes user input queries, and identifies query feedback, including ranking the query feedback, to facilitate the user in re-formatting a new query. A knowledge base, which comprises a plurality of nodes depicting terminological concepts, is arranged to reflect conceptual proximity among the nodes. The information retrieval system processes the queries, identifies topics related to the query as well as query feedback terms, and then links both the topics and feedback terms to nodes of the knowledge base with corresponding terminological concepts. At least one focal node is selected from the knowledge base based on the topics to determine a conceptual proximity between the focal node and the query feedback nodes. The query feedback terms are ranked based on conceptual proximity to the focal node. A content processing system that identifies themes from a corpus of documents for use in query feedback processing is also disclosed. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
Document Theme Vector
Document Theme
Themes Strength Classification Category
Theme.sub.1 190 (category.sub.a)
Theme.sub.2 110 None
Theme.sub.3 70 (Category.sub.c)
Theme.sub.4 27 (Category.sub.d)
. . .
. . .
. . .
Theme.sub.n 8 (Category.sub.z)
As shown in Table 1, a document theme vector 160 for a document includes a list of document themes, indicated in Table 1 by Theme.sub.1 -Theme.sub.n. Each theme has a corresponding theme strength. The theme strength is calculated in the theme vector processor 750 (FIG. 7). The theme strength is a relative measure of the importance of the theme to the overall content of the document. For this embodiment, the larger the theme strength, the more important the theme is to the overall content of the document. The document theme vector 160 lists the document themes from the most important to the least important themes (e.g., theme.sub.1 -theme.sub.n). The document theme vector 160 for each document further includes, for some themes, a category for which the theme is classified. The classification category is listed in the third column of the document theme vector shown in Table 1. For example, theme.sub.1, is classified in category.sub.a, and theme.sub.3 is classified in category.sub.c. The information retrieval system 100 also utilizes a knowledge base, labeled 155 on FIG. 1. In general, the knowledge base 155 includes a plurality of nodes of categories, which represent concepts, expressed as terminology and arranged to reflect semantic, linguistic and usage associations among the categories. In one embodiment, the knowledge base 155 may contain classification and contextual information based on processing and/or compilation of thousands of documents or may contain information based on manual selection by a linguist. The contents, generation and use of the knowledge base 155 is described more fully below in section "The Knowledge Base." As shown in FIG. 1, the information retrieval system 100 includes query feedback processing 185. The query feedback processing 185 receives, as input, the document hit list and query feedback terms and generates, as output, ranked query feedback terms for display on the user's output display (e.g., computer monitor). To generate the ranked query feedback, the query feedback processing 185 accesses document theme vectors 160 cluster analysis 190 and knowledge base 155. The cluster analysis 190 receives, as input, themes from the query feedback processing 185 and generates, as output, focal categories or nodes. In general, themes, which includes a plurality of theme weights, are input to cluster analysis 190. In response, the cluster analysis 190 generates "focal categories" that reflect categories of the knowledge base most representative of the themes as depicted in the organization of categories in the knowledge base 155. The query feedback processing 185 uses the focal categories to measure conceptual proximity between the focal categories and the query feedback terms. This conceptual proximity measure is used to rank the query feedback (e.g., themes). The query processing 185 outputs retrieval information, including the ranked query feedback terminology, to a screen module (not shown). In general, the screen module processes the ranked query feedback terms to display the terms in a predetermined form. A screen module, which processes information for display on a computer output display, is well known in the art and will not be described further. One embodiment for generating query feedback terms for use in query feedback processing 185 is described in U.S. Pat. No. 6,094,652, issued on Jul. 25, 2000, entitled "Hierarchical Query Feedback in An Information Retrieval System," filed Jun. 10, 1998, inventor: Mohammad Faisal, which is expressly incorporated herein by reference. One embodiment for generating document hit lists is described in U.S. patent application Ser. No. 08/861,961, (pending) entitled "A Document Knowledge Base Search and Retrieval System," filed May 21, 1997, inventor: Kelly Wical, which is expressly incorporated herein by reference. Embodiments for Ranking Query Feedback Terms FIG. 2 is a flow diagram illustrating one embodiment of query feedback processing in accordance with the present invention. The process is initiated by a user entering an input user query into the system (e.g., typing the query into a computer system). The information retrieval system 100 processes the query by identifying a "document hit list" and query feedback terms (query processing 175, FIG. 1), as shown in block 202. The document hit list defines a plurality of topics which the information retrieval system determines are related to the input user query. In a document-based system, the document hit list includes identifying multiple documents, wherein each document includes one or more themes or topics. As shown in block 212, themes or topics for the document hit list are identified. For the information retrieval system 100 of FIG. 1, the query feedback processing 185 accesses the document theme vector 160 to extract the topics or themes for each document in the document hit list. In one embodiment, the document themes 160 identifies the sixteen (16) most important themes for each document in the document hit list. The themes are mapped or linked to categories of a knowledge base, as shown in block 222. The knowledge base 155 consists of a plurality of categories, arranged hierarchically, wherein each category represents a concept. For example, if one of the themes is "computer industry," then the theme "computer industry" is linked to the category "computer industry" in the knowledge base 155. Due to the normalization processing performed as part of the content processing (FIG. 7), the words/phrases for the themes are in the same form as words/phrases of the categories in the knowledge base. To identify a focal category, the information retrieval system 100 identifies one or more clusters of themes, as they are linked in the knowledge base 155, as shown in block 232 of FIG. 2. For the information retrieval system 100 shown in FIG. 1, the "clustering" of themes or topics is performed by the cluster analysis 190. The arrangement of categories or concepts in the knowledge base represent semantic, linguistic or usage relationships among the concepts of an ontology. Thus, when the themes are mapped to categories of the knowledge base, the categories identified reflect a semantic, linguistic or usage association among the themes. These associations from the knowledge base are utilized to rank query feedback terminology. Although a knowledge base with hierarchically arranged categories is described for use with query feedback processing of the present invention, any information base that reflects associations among concepts or words may be used without deviating from the spirit or scope of the invention. As shown in block 232, one or more clusters of themes are identified. In general, a cluster of themes is where a plurality of themes are located in relatively close conceptual proximity when linked in the knowledge base. The identification of theme clusters for use with ranking query feedback processing is described more fully below in conjunction with FIGS. 3, 4 and 5. As shown in block 242 of FIG. 2, a focal category is selected from each cluster. In general, a focal category or focal topic is a concept that best reflects the center of conceptual proximity of the themes or topics in a cluster. Accordingly, the focal category or topic is a single concept that best represents the thematic concepts of/in the cluster (i.e., themes). As shown in block 246, the query feedback terms are mapped to categories of the knowledge base. For the embodiment that generates query feedback terms as described in U.S. Pat. No. 6,094,652, issued on Jul. 25, 2000, the query feedback terms are generated based on categories of the knowledge base. Alternatively, for other embodiments, the query feedback terms may undergo normalization processing 120 (FIG. 7) so that the form of the query feedback term matches the form of the category names in the knowledge base. As shown in block 252 of FIG. 2, semantic proximity between focal categories and query feedback terms are calculated. The knowledge base 155 is organized to reflect semantic distances through levels of the hierarchy, as well as cross-references. In one embodiment, a quantitative value is ascribed to reflect a semantic distance between levels of the hierarchy and cross-references among categories. For example, the distance between a parent category and a child category in an ontology may represent a semantic distance of 1. For this embodiment, if a query feedback term is a grandchild category of the focal category, then that term receives a semantic distance of 2. One embodiment for calculating semantic proximity is described more fully below in the section "Semantic Proximity." As shown in block 262, the query feedback terms are ranked by the semantic proximity. As shown in block 272 of FIG. 2, the query feedback terms are displayed (e.g., a computer display) in the order in which they were ranked. Clustering Processing FIG. 3 illustrates an example of mapping themes to a portion of a knowledge base. Document themes 160 for a document hit list contains themes (theme.sub.1, theme.sub.2, . . . theme.sub.8). Each theme includes a name to identify that theme. For purposes of explanation, the themes of document set 200 are designated A, B, C, F, H', K, C' and C". In addition, an theme includes a weight. In general, the theme weight provides a quantitative relative measure for the associated theme. For the example of FIG. 3, theme, has a weight of 10; theme.sub.2 has a weight of 10; and theme.sub.3 has a weight of 10, etc. An example portion of a knowledge base is reproduced in FIG. 3, and labeled 210. Knowledge base 210 includes two independent ontologies, labeled 220 and 230. The ontology 220 contains node.sub.A as the highest level node in the hierarchy, and includes six levels total. Ontology 230, which includes five levels, contains node.sub.G, in the highest level, and contains node.sub.K in the lowest level. As described more fully below, the nodes represent terms or concepts, and the hierarchical relationships among the concepts depict semantic, linguistic, and usage associations among those concepts. For purposes of nomenclature, the knowledge base 210 includes both ancestor relationships and descendant relationships. An ancestral relationship includes all of those nodes linked to the subject node that appear higher in the hierarchy. For example, node.sub.D has an ancestral relationship to node.sub.C', node.sub.B, and node.sub.A. Descendant relationships for a subject node include all nodes beneath the subject node in a tree hierarchy. For example, node.sub.C' has a descendant relationship with node.sub.D, node.sub.E', node.sub.E, and node.sub.F. Also, nodes may be defined as having parent-child, grandparent-child, grandchild-child, etc. relationships. For example, node.sub.I is the grandchild node of node.sub.G in ontology 230. FIG. 4 is a flow diagram illustrating one embodiment for weight summing for use in the clustering analysis 190. As shown in block 310, themes of the document themes are mapped to nodes of the tree structure. Table 2 shows, for each cluster node (i.e., a node for which a theme is mapped), accumulated weights for the corresponding cluster nodes for the example document themes 200 of FIG. 3.
TABLE 2
Theme Name Value (Weight)
A 10
C 20
D 10
F 50
H' 30
K 10
C' 10
C" 15
To map themes of a document to a tree structure, theme names are compared with node names to identify matches. For the example of FIG. 3, theme.sub.1, named "A", is mapped to or associated with node.sub.A in ontology 220. Similarly, theme, is mapped to ontology 230 by associating the name "H'" with node.sub.H. As shown in block 320 of FIG. 4, raw weights are generated for nodes from the weight values of the corresponding themes. This step includes ascribing the weights for each theme as a raw weight for the corresponding node. If the set of document themes contain two or more themes having the same name, then the raw weight is a sum of all of those theme weights. Table 2 shows, for each theme, the theme name, as mapped to the knowledge base, and a column for the corresponding weight for the example of FIG. 3. In addition to accumulating raw weights, the cluster analysis 190 (FIG. 1) also calculates a descendant weight. As shown in block 330 of FIG. 4, weights of all descendant nodes for nodes selected are summed to generate a descendant weight. Table 3 shows, for each theme of example theme set 200, a descendant weight. Specifically, the descendant weight is determined by adding all of the raw weights for all of the child nodes as well as the raw weights for the descendants of the subject node. For example, a descendant weight of one hundred and five (105) for node.sub.A is calculated by summing weights for all of its descendants (i.e., 20+10+15+10+50 from node.sub.C node.sub.C ', node.sub.C ", node.sub.D, and node.sub.F, respectively). Table 3 shows, for each cluster node, the name associated with the node and its descendant weight.
TABLE 3
Node Name Descendant Weight
C' 60
C" 0
A 105
C 0
D 50
F 0
H 10
K 0
As shown in block 340 of FIG. 4, an ancestor weight, which measures how much weight is from the ancestors of the subject node (i.e., the node under analysis), is calculated. In one embodiment, to calculate an ancestor weight, the parent weight (i.e., the raw weight plus the nodes parent weight) is divided proportionally among the child nodes based on the relative weight of the child nodes. Table 4 shows, for each node, the node name and corresponding ancestor weight.
TABLE 4
Node Name Ancestor Weight
C' 6.7
C" 1.4
A 0
C 1.9
D 11.9
F 21.9
H 0
K 0
For the example of FIG. 3, node.sub.C has an ancestor weight of approximately 1.9, node.sub.C ' has an ancestor weight of approximately 6.7, and node.sub.C " has an ancestor weight of approximately 1.4, from the raw weight of node.sub.A. Node.sub.F has a total ancestor weight of 24.44 from the raw weight of node.sub.C ' plus the ancestor weight of node.sub.C '. Accordingly, the output of weight summing provides three weights. A raw weight that represents the weight from the themes of the theme set for clustering; a descendant weight that measures the amount of weight from the descendants of a particular node; and an ancestor weight that measures the amount of weight from the ancestors of a particular node. FIG. 5 is a flow diagram illustrating one embodiment for selecting focal categories in accordance with one embodiment for clustering analysis. As shown in block 410, the process begins by selecting the node highest in the tree and not yet evaluated. As shown in blocks 420 and 445, if the percent difference between the raw weight plus the ancestor weight to the descendant weight is greater than the depth cut-off percent parameter, then that node is selected as the focal point for the cluster. Table 5 shows raw plus ancestor weights and descendant weights for each node of the set of nodes in the example of FIG. 3.
TABLE 5
Raw Weight +
Node Name Ancestor Weight Descendant Weight
A 10 105
C 21.9 0
C' 16.7 60
C" 16.4 0
D 26.7 50
F 76.7 0
H 30 0
K 10 0
Starting with node.sub.A in tree structure 220, the percentage of the raw plus ancestor weight to the descendant weight is less than the percentage cut-off weight of fifty-five percent (55%). Thus, for ontology 220, node.sub.A is not selected as the focal point for this cluster. In tree structure 230, the process compares the percentage of the raw plus ancestor weight of node.sub.H' to the descendant weight of node.sub.H'. In this case, if the depth cut-off percent parameter is less than 66%, then node.sub.H' is selected as the focal point of the cluster on ontology 230. As shown in blocks 420 and 430 of FIG. 5, if the percentage of the raw plus ancestor weight of the selected node to the descendant weight is less than the depth cut-off percent parameter, then child nodes, in that ontology, are evaluated. As shown in blocks 450 and 460, the child nodes are selected and ordered, in a descending order, by total weight (i.e., raw weight+ancestor weight+descendant weight). For ontology 220 of FIG. 3, the descending order for the remaining child nodes is: Node.sub.C' (81.9), Node.sub.C (21.9), and Node.sub.C" (16.4). As shown in block 470 of FIG. 5, the percent of the smaller of the pair to the Larger (e.g. an actual percentage difference) is calculated for each adjoining pair of child nodes. For the ontology 220 of FIG. 3, C' is twenty-six percent (26%) of the value of node.sub.C and C" is seventy four percent (74%) of the value of node.sub.C. As shown in block 480 of FIG. 5, an average expected percent difference or drop for all of the child nodes is computed as 100--(100/# Items left on the list) (e.g., a thirty three and a third (33.3%) for the example of FIG. 3). In general, cluster analysis 190 (FIG. 1) utilizes the calculated percent differences and the breadth cut-off percentage to select child nodes as candidates for the focal point node, as shown in block 490 of FIG. 5. This process may result in selection of more than a single focal node from the list of child nodes. The selected child nodes are utilized in the process starting again at block 410 as shown in FIG. 5. Knowledge Base In general, the knowledge base 155 is the repository for all knowledge about languages and about the concrete and abstract worlds described by language in human discourse. The knowledge base 155 contains two major types of data: language specific data necessary to describe a language used for human discourse, and language independent data necessary to describe the meaning of human discourse. In general, in normalization processing 120 (FIG. 7), given a term, the goal is to analyze and manipulate its language dependent features until a language independent ontological representation is found. The knowledge base 155 consists of concepts, general categories, and cross-references. Concepts, or detailed categories, are a subset of the canonical forms determined by the language dependent data. These concepts themselves are language independent. In different languages their text representations may be different; however, these terms represent the universal ontological location. Concepts are typically thought of as identification numbers that have potentially different representations in different languages. These representations are the particular canonical forms in those languages. General categories are themselves concepts, and have canonical form representations in each language. These categories have the additional property that other concepts and general categories can be associated with them to create a knowledge hierarchy. Cross references are links between general categories. These links augment the ancestry links that are generated by the associations that form a directed graph. The ontology in the knowledge base 155 contains only canonical nouns and noun phrases, and it is the normalization processing 120 (FIG. 7) that provides mappings from non-nouns and non-canonical nouns. The organization of the knowledge base 155 provides a world view of knowledge, and therefore the ontology actually contains only ideas of canonical nouns and noun phrases. The text representation of those ideas is different in each language, but the ontological location of the ideas in the knowledge base 155 remains the same for all languages. The organizational part of the knowledge base 155 is the structured category hierarchy comprised at the top level of general categories. These categories represent knowledge about how the world is organized. The hierarchy of general categories is a standard tree structure. In one embodiment, a depth limit of sixteen levels is maintained. The tree organization provides a comprehensive structure that permits augmentation of more detailed information. The tree structure results in a broad but shallow structure. The average depth from tree top to a leaf node is five, and the average number of children for non-leaf nodes is 4.5. There are two types of general categories: concrete and abstract. This distinction is an organizational one only and it has no functional ramifications. A concrete category is one that represents a real-world industry, field of study, place, technology or physical entity. The following are examples of concrete categories: "chemistry", "computer industry", "social identities", "Alabama", and "Cinema." An abstract category is one that represents a relationship, quality, fielding or measure that does not have an obvious physical real-world manifestation. The following examples are abstract categories: "downward motion", "stability", "stupidity, foolishness, fools", "mediation, pacification", "texture", and "shortness." Many language dependent canonical forms mapped to the language independent concepts stored as the knowledge base 155. The concept is any idea found in the real world that can be classified or categorized as being closely associated with one and only one knowledge base 155 general category. Similarly, any canonical form in a particular language can map to one and only one concept. For example, there is a universal concept for the birds called "cranes" in English, and a universal concept for the machines called "cranes" in English. However, the canonical form "cranes" does not map to either concept in English due to its ambiguity. In another language, which may have two different canonical forms for these concepts, mapping may not be a problem. Similarly, if "cranes" is an unambiguous canonical form in another language, then no ambiguity is presented in mapping. Cross references are mappings between general categories that are not directly ancestrally related, but that are close to each other ontologically. Direct ancestral relationship means parent-child, grandparent-grandchild, great grandparent-great grandchild, etc. Cross references reflect a real-world relationship or common association between the two general categories involved. These relationships can usually be expressed by universal or majority quantification over one category. Examples of valid cross references and the relationships are shown in Table 6.
TABLE 6
oceans --> fish (all oceans have fish)
belief systems --> moral states (all belief
systems address moral states)
electronics --> physics (all electronics deals
with physics)
death and burial --> medical problems (most
cases of death and burial are caused by medical
problems)
Cross references are not automatically bidirectional. For example, in the first entry of Table 6, although oceans are associated with fish, because all oceans have fish, the converse may not be true since not all fish live in oceans. The names for the general categories are chosen such that the cross references that involve those general categories are valid with the name or label choices. For example, if there is a word for fresh water fish in one language that is different than the word for saltwater fish, the oceans to fish cross reference is not valid if the wrong translation of fish is used. Although the knowledge base 155 is described as cross linking general categories, concepts may also be linked without deviating from the spirit and scope of the invention. FIG. 6 illustrates an example portion of a knowledge base including cross references and links among categories and terms. The classification hierarchy and notations shown in FIG. 6 illustrate an example that classifies a document on travel or tourism, and more specifically on traveling to France and visiting museums and places of interest. As shown in FIG. 6, the classification categories (e.g., knowledge catalog 560) contains two independent static ontologies, one ontology for "geography", and a second ontology for "leisure and recreation." The "geography" ontology includes categories for "political geography", "Europe", "Western Europe", and "France." The categories "arts and entertainment" and "tourism" are arranged under the high level category "leisure and recreation." The "visual arts" and the "art galleries and museums" are subcategories under the "arts and entertainment" category, and the category "places of interest" is a subcategory under the category "tourism." The knowledge base 155 is augmented to include linking and cross referencing among categories for which a linguistic, semantic, or usage association has been identified. For the example illustrated in FIG. 6, the categories "France", "art galleries and museums", and "places of interest" are cross referenced and/or linked as indicated by the circles, which encompass the category names, as well as the lines and arrows. This linking and/or cross referencing indicates that the categories "art galleries and museums" and "places of interest" may appear in the context of "France." For this example, the knowledge base 155 indicates that the Louvre, a proper noun, is classified under the category "art galleries and museums", and further associates the term "Louvre" to the category "France."Similarly, the knowledge base 155 indicates that the term "Eiffel Tower" is classified under the category "places of interest", and is also associated with the category "France." The knowledge base 155 may be characterized, in part, as a directed graph. The directed graph provides information about the linguistic, semantic, or usage relationships among categories, concepts and terminology. The "links" or "cross references" on the directed graph, which indicate the associations, is graphically depicted in FIG. 6 using lines and arrows. For the example shown in FIG. 6, the directed graph indicates that there is a linguistic, semantic, or usage association among the concepts "France", "art galleries and museums", and "places of interest." Semantic Proximity To rank the query feedback terms, semantic proximity among the focal categories or nodes and the query feedback terms is calculated (block 252, FIG. 2). For purposes of nomenclature, the focal nodes (i.e., those nodes in the knowledge base 155 selected during cluster analysis) are referred to as cluster nodes, and represented as C[i], wherein i is an integer value that ranges between 1 and N and N represents the total number of cluster nodes. The weights or strengths associated with each cluster node are referred to as W[i], wherein i is the same variable referred to in defining the cluster node (i.e., i ranges between 1 and N, where N is the total number of cluster nodes). Also, for purposes of nomenclature, the node of a category in a knowledge base associated with a query feedback term is referred to as "F. " Using the above nomenclature, measuring semantic proximity may be referred to as establishing a quantitative measure of semantic proximity between F and the set C. The knowledge base 155 includes three different types of links between any two nodes: (1) parent (2) child and (3) cross-reference. In one embodiment, to measure semantic proximity, the following weights are associated with each type of link as shown in Table 7.
TABLE 7
Link Type Weight
Parent p
Child q
XRef r
The weights p, q, and r are numeric values greater than or equal to 1. To calculate a semantic proximity, the query feedback processing 185 (FIG. 1) first identifies the shortest path between query feedback term nodes and cluster nodes. Specifically, to identify the shortest path, for each C[i], the path with the least number of links, regardless of the link type, is located between F (query term node) to each of the focal nodes C[i]. However, this calculation may lead to multiple paths between a query feedback term node and a cluster node. Defining the paths identified as P[i][j], wherein j ranges between 1 and M[i], and M[i] is the total number of paths identified between the feedback node, F, and the cluster node C[i]. P[x][y] refers to the y.sup.th path identified above between the feedback node, F, and the focal node, C[i]. Each path in the set of P is composed of parent, and/or child, and/or cross-reference links. For this embodiment, to compute the semantic proximity, another weight, referred as the self-link weight, is defined. A value, "s," is assigned for this self-link weight. The following pseudo-code defines steps for computing semantic proximity (SA) in accordance with one embodiment for query feedback processing.
SA = 0 /* initialize semantic proximity to zero */
for each x in 1..N /* for each cluster node */
{
psd = 0 /* initialize path specific semantic proximity contribution */
for each y in 1..M[x] /* for each path identified from F to
* cluster node C[x] */
{
tmp = W[x]/(s + (sum of weights associated with each link in
path P[x][y]))
if tmp > psd then /* find maximum */
psd = tmp
}
increment SA by psd
}
Once semantic proximity is calculated, it is used to rank feedback terms. As discussed above, semantic proximity is directly proportional to the rank of the feedback term. Specifically, the higher the semantic proximity, the higher the feedback term is ranked. In one embodiment, the values for the link weights are derived using empirical testing. The following constraints are used to prevent deterioration of the quality of ranking: 1. s<p 2. s<q 3. s<r These constraints ensure that if F is equal to C[i], for some i between 1 and N then its contribution to the total semantic proximity between F and the set C is higher than the contribution to the total semantic proximity from the pair F and C[y], for some y between 1 and N, but y not equal to x. This also assumes that the weight W[x] is equal to W[y]. For example, one set of link weights that complies with the above constraints is listed in Table 8.
TABLE 8
Weight Values
p 2
q 2
r 2
s 1
Content Processing System FIG. 7 is a block diagram illustrating one embodiment for a content processing system. In general, the content processing system 110 analyzes words and phrases to identify the thematic content. For example, the content processing system 110 analyzes the document set 130 and generates the document theme vector 160. For this embodiment, the content processing system 110 includes a linguistic engine 700, a normalization processing 120, a theme vector processor 750, and a morphology section 770. The linguistic engine 700 receives, as input, the document set 130, and generates, as output, the structured output 710. The linguistic engine 700, which includes a grammar parser and a theme parser, processes the document set 130 by analyzing the grammatical or contextual aspects of each document, as well as analyzing the stylistic and thematic attributes of each document. Specifically, the linguistic engine 700 generates, as part of the structured output 710, contextual tags 720, thematic tags 730, and stylistic tags 735 that characterize each document. Furthermore, the linguistic engine extracts topics and content carrying words 737, through use of the thematic tags 730, for each sentence in the documents. For a detailed description of the contextual and thematic tags, see U.S. Pat. No. 5,694,523, inventor Kelly Wical, entitled "Content Processing for Discourse", filed May 31, 1995, that includes an Appendix D, entitled "Analysis Documentation", which is expressly incorporated herein by reference. In one embodiment, the linguistic engine 700 generates the contextual tags 720 via a chaos loop processor. All words in a text have varying degrees of importance in the text, some carrying grammatical information, and others carrying the meaning and content of the text. In general, the chaos loop processor identifies, for words and phrases in the documents, grammatical aspects of the documents including identifying the various parts of speech. In order to accomplish this, the chaos loop processor ascertains how the words, clauses and phrases in a sentence relate to each other. By identifying the various parts of speech for words, clauses, and phrases for each sentence in the documents, the context of the documents is defined. The chaos loop processor stores information in the form of the contextual tags 720. U.S. Pat. No. 5,694,523, inventor Kelly Wical, entitled "Content Processing System for Discourse", filed May 31, 1995, includes an Appendix C, entitled "Chaos Processor for Text", that contains an explanation for generating contextual or grammatical tags. A theme parser within the linguistic engine 700 generates the thematic tags 730. Each word carries thematic information that conveys the importance of the meaning and content of the documents. In general, the thematic tags 730 identify thematic content of the document set 130. Each word is discriminated in the text, identifying importance or meaning, the impact on different parts of the text, and the overall contribution to the content of the text. The thematic context of the text is determined in accordance with predetermined theme assessment criteria that is a function of the strategic importance of the discriminated words. The predetermined thematic assessment criteria defines which of the discriminated words are to be selected for each thematic analysis unit. The text is then output in a predetermined thematic format. For a further explanation of a theme parser, see Appendix E, entitled "Theme Parser for Text", of U.S. Pat. No. 5,694,523, inventor Kelly Wical, issued Dec. 2, 1997, entitled "Content Processing System for Discourse", filed May 31, 1995. As shown in FIG. 7, the morphology section 770 contains the knowledge catalog 560 and a lexicon 760. In one embodiment, the knowledge catalog 560 identifies categories for the document themes. For this embodiment, the knowledge catalog 560 contains categories, arranged in a hierarchy, that reflect a world view of knowledge. Appendix A of U.S. Pat. No. 5,694,523, inventor Kelly Wical, issued Dec. 2, 1997 entitled "Content Processing System for Discourse", filed May 31, 1995, is an example of a knowledge catalog for use in classifying documents. In general, the lexicon 760 stores definitional characteristics for a plurality of words and terms. For example, the lexicon 212 defines whether a particular word is a noun, a verb, an adjective, etc. The linguistic engine 700 uses the definitional characteristics stored in the lexicon 760 to generate the contextual tags 720, thematic tags 730, and the stylistic tags 735. An example lexicon, for use with a content processing system, is described in Appendix B, entitled "Lexicon Documentation", of U.S. Pat. No. 5,694,523, inventor Kelly Wical, entitled "Content Processing System for Discourse", filed May 31, 1995. The topics and content carrying words 737 are input to the normalization processing 120. In part, the normalization processing 120 processes the content carrying words for direct use with the knowledge catalog 560 and knowledge base 155. Specifically, the normalization processing 120 generates, as appropriate, canonical, nominal or noun form of each content carrying word, as well as the count sense and mass sense of the word. Furthermore, the normalization processing 120 determines, from the knowledge catalog 560, which content carrying words are non ambiguous. As shown in FIG. 7, the theme vector processor 750 receives the thematic tags 730 and contextual tags 720 from the structured output 710. In addition, the non ambiguous content carrying words from the normalization processing 120 are input to the theme vector processor 750. The content carrying words may include single words or phrases. The content carrying words output from the normalization processing 120 have been converted to the noun or nominal form. In general, the theme vector processor 750 presents a thematic profile of the content of each document (e.g., generates the document theme vector 160 including classifying the documents in the knowledge catalog 560. To accomplish this, the theme vector processor 750 determines the relative importance of the non ambiguous content carrying words in the document set. In one embodiment, the theme vector processor 750 generates a list of theme terms, including words and phrases, and assigns a relative theme strength to each theme term. The theme vector processor 750, through use of the knowledge catalog 560, generates a theme concept for each theme term by mapping the theme terms to categories in the knowledge catalog 560. Thus, the theme concepts indicate a general topic or category in the knowledge catalog 560 to identify the content of each document. In addition, the theme vector processor 750 generates, for each theme term, an importance number, a theme strength, and an overall capacity weight of collective content importance. In one embodiment, the theme vector processor 750 executes a plurality of heuristic routines to generate the theme strengths for each theme. U.S. Pat. No. 5,694,523, inventor Kelly Wical, entitled "Content Processing System for Discourse", contains source code to generate the theme strengths in accordance with one embodiment for theme vector processing. Also, a further explanation of generating a thematic profile is contained in U.S. Pat. No. 5,694,523. Computer System FIG. 8 illustrates a high level block diagram of a general purpose computer system in which the information retrieval system of the present invention may be implemented. A computer system 1000 contains a processor unit 1005, main memory 1010, and an interconnect bus 1025. The processor unit 1005 may contain a single microprocessor, or may contain a plurality of microprocessors for configuring the computer system 1000 as a multi-processor system. The main memory 1010 stores, in part, instructions and data for execution by the processor unit 1005. If the information retrieval system of the present invention is wholly or partially implemented in software, the main memory 1010 stores the executable code when in operation. The main memory 1010 may include banks of dynamic random access memory (DRAM) as well as high speed cache memory. The computer system 1000 further includes a mass storage device 1020, peripheral device(s) 1030, portable storage medium drive(s) 1040, input control device(s) 1070, a graphics subsystem 1050, and an output display 1060. For purposes of simplicity, all components in the computer system 1000 are shown in FIG. 8 as being connected via the bus 1025. However, the computer system 1000 may be connected through one or more data transport means. For example, the processor unit 1005 and the main memory 1010 may be connected via a local microprocessor bus, and the mass storage device 1020, peripheral device(s) 1030, portable storage medium drive(s) 1040, graphics subsystem 1050 may be connected via one or more input/output (I/O) busses. The mass storage device 1020, which may be implemented with a magnetic disk drive or an optical disk drive, is a non-volatile storage device for storing data and instructions for use by the processor unit 1005. In the software embodiment, the mass storage device 1020 stores the information retrieval system software for loading to the main memory 1010. The portable storage medium drive 1040 operates in conjunction with a portable non-volatile storage medium, such as a floppy disk or a compact disc read only memory (CD-ROM), to input and output data and code to and from the computer system 1000. In one embodiment, the information retrieval system software is stored on such a portable medium, and is input to the computer system 1000 via the portable storage medium drive 1040. The peripheral device(s) 1030 may include any type of computer support device, such as an input/output (I/O) interface, to add additional functionality to the computer system 1000. For example, the peripheral device(s) 1030 may include a network interface card for interfacing the computer system 1000 to a network. For the software implementation, the documents may be input to the computer system 1000 via a portable storage medium or a network for processing by the information retrieval system. The input control device(s) 1070 provide a portion of the user interface for a user of the computer system 1000. The input control device(s) 1070 may include an alphanumeric keypad for inputting alphanumeric and other key information, a cursor control device, such as a mouse, a trackball, stylus, or cursor direction keys. In order to display textual and graphical information, the computer system 1000 contains the graphics subsystem 1050 and the output display 1060. The output display 1060 may include a cathode ray tube (CRT) display or liquid crystal display (LCD). The graphics subsystem 1050 receives textual and graphical information, and processes the information for output to the output display 1060. The components contained in the computer system 1000 are those typically found in general purpose computer systems, and in fact, these components are intended to represent a broad category of such computer components that are well known in the art. The information retrieval system may be implemented in either hardware or software. For the software implementation, the information retrieval system is software that includes a plurality of computer executable instructions for implementation on a general purpose computer system. Prior to loading into a general purpose computer system, the information retrieval system software may reside as encoded information on a computer readable medium, such as a magnetic floppy disk, magnetic tape, and compact disc read only memory (CD-ROM). In one hardware implementation, the information retrieval system may comprise a dedicated processor including processor instructions for performing the functions described herein. Circuits may also be developed to perform the functions described herein. The knowledge catalog 560 and knowledge database 155 may be implemented as a database stored in memory for use by the information retrieval system. Although the present invention has been described in terms of specific exemplary embodiments, it will be appreciated that various modifications and alterations might be made by those skilled in the art without departing from the spirit and scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
