Hierarchical query feedback in an information retrieval system6094652Abstract An information retrieval system generates hierarchical query feedback to a user to facilitate the user in reformulating the query. The information retrieval system, which supports both text and theme queries, includes a knowledge base comprising a plurality of nodes of terminology, arranged hierarchically, that reflect associations among the terminology. For the hierarchical query feedback terms, the information retrieval system selects terminology that broadens and narrows the query terms by selecting parent nodes and child nodes, respectively, of the nodes for terminology that corresponds to the terms of the query. The information retrieval system also selects terminology that is generally related to the query terms by selecting nodes of the knowledge base that are cross linked to the nodes for terminology that corresponds to the terms of the query. Normalization processing, which generates canonical forms for query processing, and a content processing system, which generates themes for theme queries, are also disclosed. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
beauty
beautifulness
beautiousness
beauties
scamps
scampishness
scampishnesses
scamp
stupidity
dull-headedness
dull-headednesses
lame-brainedness
lame-brainednesses
stupidities
cooperation
cooperating
______________________________________
Exceptions to this rule are nouns that have become very common in their "ness", "ing" or "bility" forms, or are not readily separable from their suffixes such as "sickness", "fishing" and "notability." Similar to non-noun based forms, canonical nouns do not have mood-changing prefixes. Table 2 lists three non-noun base forms, in their canonical forms, with non-mood-changing prefixes indented.
TABLE 2
______________________________________
spots
unspottedness
spottedness
professionalism
unprofessionalism
taste
distastefulness
tastefulness
______________________________________
Exceptions to this rule are, as with non-noun based forms, those nouns which, when the prefix is removed, do not retain their meaning or even their part of speech. Examples of these exceptions are "distension", "exploration", or "unction". As shown in block 420 of FIG. 4, if the canonical form exists, then the canonical form is used instead of the query term as the token for query processing. As shown in block 430 of FIG. 4, normalization processing 120 ascertains whether the query term is a noun. In one embodiment, the lexicon 760 (FIG. 7) indicates whether the query term is a noun. In English, proper nouns are defined as nouns that represent specific people, places, days and months, organizations, businesses, products, religious items, or works of art Proper nouns initial letters are almost always capitalized. Exceptions to capitalization are rare, and are usually for artistic or attention getting reasons. A proper noun phrase is a noun phrase that begins and ends with a proper noun. Table 3 lists valid proper nouns or noun phrases.
TABLE 3
______________________________________
Chicago
IBM
Carlton Fisk
October
International Business Machines Corporation
International Society of Engineers
e.e. cummings
Judgement Day
______________________________________
Table 4 lists noun phrases that are not valid proper noun phrases.
TABLE 4
______________________________________
California condor
heart of Texas
AWOL (this is an acronym of a common noun phrase)
______________________________________
In very rare cases, proper nouns or noun phrases pluralize. If they do, the plural form is canonical. For example, "Texans" is the canonical form of "Texan." Also, "Geo Prisms" is the canonical form of "Geo Prism." When a proper noun phrase is represented by an acronym, the canonical form is a phrase consisting of the acronym, without periods, followed by a hyphened followed by the full unabbreviated noun phrase. Each possible form of the acronym and the phrase it stands for becomes the alternate form of the new canonical form. Table 5 lists the canonical form first with the non-exhaustive examples of alternate forms indented.
TABLE 5
______________________________________
IBM - International Business Machines Corporation
IBM
I.B.M.
International Business Machines Corporation
International Business Machines Corp.
IBM Corp.
MISL - Major Indoor Soccer League
MISL
M.I.S.L.
Major Indoor Soccer League
______________________________________
Commercial names also appear as query terms. There are many alternate forms for most commercial proper noun phrases. Those phrases, although do not have acronyms associated with them, still require a consistent canonical form representation. For English proper noun phrases, Table 6 lists a set of rules for commercial names.
TABLE 6
______________________________________
All abbreviations will be spelled out
Inc. --> Incorporated
Int'l. --> International
Org. --> Organization
Hyphens will be preferred where there is a choice
Long Term --> Long-Term
Alka Seltzer --> Alka-Seltzer
Ampersands will be used in place of the word `and`
Cahill, Gordon and Reindel --> Cahill, Gordon & Reindel
Growth and Income --> Growth & Income
______________________________________
The rules, set forth in Table 6, when combined in proper noun phrases with multiple features, create many alternate forms from a single canonical form. Since there is no way to predict how a company or product is going to be referred to in a query, this proliferation of alternate forms is necessary to achieve consistent representations whenever possible. Table 7 lists the canonical form of a corporation, and then continue with an indented list of alternate forms.
TABLE 7
______________________________________
Cahill, Gordon & Reindel
Cahill, Gordon and Reindel
Cahill, Gordon, & Reindel
Cahill, Gordon, and Reindel
Commodore International, Incorporated
Commodore, Inc.
Commodore Inc.
Commodore, Inc
Commodore, Incorporated
Commodore Incorporated
Commodore International
Commodore International, Inc.
Commodore International Inc.
Commodore International, Inc
Commodore International Inc
Commodore International Incorporated
Commodore Int'l., Inc.
Commodore Int'l., Inc
Commodore Int'l. Inc.
Commodore Int'l. Inc
Commodore Int'l. Incorporated
Commodore Int'l., Incorporated
Commodore Int'l, Inc.
Commodore Int'l, Inc
Commodore Int'l Inc.
Commodore Int'l Inc
Commodore Int'l Incorporated
Commodore Int'l, Incorporated
______________________________________
The canonical forms of common noun phrases are created using the same rules as single common nouns and proper noun phrases. The mass singular form is preferred, the count plural form is next as illustrated in FIG. 4. Any abbreviations, acronyms, hyphens or ampersands are handled as they are in proper noun phrases. Table 8 lists canonical forms and common noun phrases, indented, that pertain to the canonical form.
TABLE 8
______________________________________
atomic bombs
A-bomb
A bomb
A-bombs
A bombs
atom bomb
atom bombs
atomic bomb
satirical poetry
satirical poetries
______________________________________
Some noun phrases refer to the same entity, and are referred to as "multiple referents." Cases where different nouns or noun phrases refer to exactly the same entity, then one noun is usually selected as the canonical form, and the other nouns considered alternate forms. Table 9 lists noun and noun phrases that refer to the same entity, wherein the canonical form is left justified and the alternate forms are indented.
TABLE 9
______________________________________
Mark Twain
Samuel Clemens
Samuel L Clemens
Samuel L. Clemens
Samuel Longhorn Clemens
angelfish
angelfishes
scalare
scalares
______________________________________
As shown in block 440 of FIG. 4, if the query term is not a noun, then a determination is made as to whether the query term has a nominal form. If the query term has a nominal form, then the nominal form is used as a token, instead of the query term as shown in block 450. If the term does not have a nominal form, then the query term is used as the token as shown in block 480. If the term is a noun, as ascertained in block 430, then a further inquiry determines whether the query term is a mass noun as shown in block 460. The preferred canonical form of a noun or noun phrase in English is its mass singular form. Nouns, which are mass only nouns, such as "chess" or "goats milk" have only one form, and this is the canonical form. However, most nouns that are mass nouns are also count nouns. The canonical form of count nouns is typically the mass singular form. Examples of these types of nouns are "description", "fish", and "cheese." The count plural forms of these nouns ("descriptions", "fishes", and "cheeses") are referred to as alternate forms, and are transformed to the mass singular form for use as tokens. As shown in block 460 of FIG. 4, if the query term is not a mass noun, then the normalization processing determines whether the query term has a plural form as shown in block 470. If a noun or a noun phrase does not have a mass sense, then its canonical form is the count plural form. Nouns such as "chemical", "personal computer", and "California Condor" are alternate forms of the canonicals "chemicals", "personal computers", and "California Condors", respectively. If the plural form does exist, then the plural form is used as the token for query processing as shown in block 475. If the plural form does not exist, then the query term is used as the token as shown in block 480. Whether mass or count, there are several noun candidates for canonical form which are very close in meaning, but which have various levels of desirability based on morphology. Typically, nouns ending in "ness", "ming", and "bility" do not make very good canonical forms and are usually listed as alternate forms of more basic nouns. Unsuffixed forms are preferred. As shown in block 490 of FIG. 4, the nominalization process is repeated for each query term. The normalization processing 120 also includes processes to eliminate the case sensitivity problem, when appropriate. The content processing system 110 (FIG. 7) includes a lexicon 760. The lexicon 760 contains information (e.g., definitional characteristics) for a plurality of words. One definitional characteristic defines the part of speech for the corresponding word. For example, the lexicon 760 identifies whether a word is a common noun. Furthermore, the lexicon 760 identifies the amount of content carrying information for a corresponding word. In general, the normalization processing 120 utilizes the definitional characteristics in the lexicon to determine whether to generate a lower case term from an upper case term when input as a query term. In one embodiment, the normalization processing 120 generates lower case terms if the corresponding upper case term is both a common noun and a content carrying word. Names, which are proper nouns, are not converted. For query terms converted, both the upper case term and the lower case term are used to process the query. Although certain upper case terms are converted to lower case terms, the original upper case query term is considered more relevant to the original query than the lower case term. FIGS. 5a-5d illustrate an example of hierarchical query feedback in accordance with one embodiment of the present invention. For this example, a user inputs the query "Internet technology." Through normalization, either in normalization processing 120 or content processing 110, the query term "Internet technology" is flagged as a noun phrase (i.e., the terms are considered a single token). The token "Internet technology" is mapped to the "Internet technology" node in the knowledge base as shown by the arrow coupling the phrase Internet technology to the node labeled "Internet technology" on FIG. 5a. To select a broader category, the term "computer networking" is selected since the node "computer networking" is a parent node to the token node "Internet technology", as shown in FIG. 5b. In one embodiment, a maximum of one broader feedback term is provided as hierarchical query feedback. To identify terminology that is narrower than the term "Internet technology", child categories of the token node "Internet technology" is selected. FIG. 5c illustrates a portion of a knowledge base that includes child nodes for the parent node "Internet technology." As shown in FIG. 5c, numerous child nodes are selected. To identify related terminology for the query "Internet technology", related terminology, cross-inked to the token node "Internet technology", is selected as related feedback terminology. FIG. 5d illustrates a portion of a knowledge base that includes terminology cross-linked to the "Internet technology" node. Table 10 lists the composite hierarchical query feedback terminology for the example query shown in FIGS. 5a-5d for the example query "Internet technology."
TABLE 10
______________________________________
Feedback Type
Feedback Term Cardinality
______________________________________
Broader computer networking
26
Narrower A & W Internet 4
Narrower Alliance, Incorporated 32
Narrower CO + RE 1
Narrower Communique, Incorporated 2
Narrower CompuServe, Incorporated 6
Narrower Gopher 1
Narrower IN, SEC 278
Narrower Internet 4
Narrower Jughead 1
Narrower Metropolitan Fiber Systems 1
Narrower Mother, Incorporated 539
Narrower NB*net 384
Narrower Prodigy Services Company 10
Narrower Prospero 4
Narrower Spry, Incorporated 1
Narrower Support Group, Incorporated 1344
Narrower WWW - World Wide Web 2
Narrower Yahoo 1
Narrower bulletin boards 3
Narrower electronic forms 1
Narrower resource center 2
Narrower zNET 15
Related cyberculture 12
Related information technology 88
Related library science 83
Related programming languages 15
Related publishing industry 3302
______________________________________
Table 10 includes three columns: feedback type, feedback term, and cardinality. For this embodiment, for each feedback term, a cardinality value is shown. The cardinality for a feedback term indicates the potential size of the hit-list retrieved if that term, input by itself, were the input query. For example, if the input query consisted of "computer networking", then 26 documents, identified from the document repository 130, would be retrieved for that input query. Knowledge Base: In general, the knowledge base 155 is the repository for all knowledge about languages and about the concrete and abstract worlds described by language in human discourse. The knowledge base 155 contains two major types of data: language specific data necessary to describe a language used for human discourse, and language independent data necessary to describe the meaning of human discourse. In general, in nominalization processing, given a term, the goal is to analyze and manipulate its language dependent features until a language independent ontological representation is found. The knowledge base 155 consists of concepts, general categories, and cross-references. Concepts, or detailed categories, are a subset of the canonical forms determined by the language dependent data. These concepts themselves are language independent. In different languages their text representations may be different; however, these terms represent the universal ontological location. Concepts are typically thought of as identification numbers that have potentially different representations in different languages. These representations are the particular canonical forms in those languages. General categories are themselves concepts, and have canonical form representations in each language. These categories have the additional property that other concepts and general categories can be associated with them to create a knowledge hierarchy. Cross references are links between general categories. These links augment the ancestry links that are generated by the associations that form a directed graph. The ontology in the knowledge base 155 contains only canonical nouns and noun phrases, and it is the normalization processing 120 that provides mappings from non-nouns and non-canonical nouns. The organization of the knowledge base 155 provides a world view of knowledge, and therefore the ontology actually contains only ideas of canonical nouns and noun phrases. The text representation of those ideas is different in each language, but the ontological location of the ideas in the knowledge base 155 remains the same for all languages. The organizational part of the knowledge base 155 is the structured category hierarchy comprised at the top level of general categories. These categories represent knowledge about how the world is organized. The hierarchy of general categories is a standard tree structure. In one embodiment, a depth limit of sixteen levels is maintained. The tree organization provides a comprehensive structure that permits augmentation of more detailed information. The tree structure results in a broad but shallow structure. The average depth from tree top to a leaf node is five, and the average number of children for non-leaf nodes is 4.5. There are two types of general categories: concrete and abstract. This distinction is an organizational one only and it has no functional ramifications. A concrete category is one that represents a real-world industry, field of study, place, technology or physical entity. The following are examples of concrete categories: "chemistry", "computer industry", "social identities", "Alabama", and "Cinema." An abstract category is one that represents a relationship, quality, fielding or measure that does not have an obvious physical real-world manifestation. The following examples are abstract categories: "downward motion", "stability", "stupidity, foolishness, fools", "mediation, pacification", "texture", and "shortness." Many language dependent canonical forms mapped to the language independent concepts stored as the knowledge base 155. The concept is any idea found in the real world that can be classified or categorized as being closely associated with one and only one knowledge base 155 general category. Similarly, any canonical form in a particular language can map to one and only one concept. For example there is a universal concept for the birds called "cranes" in English, and a universal concept for the machines called "cranes" in English. However, the canonical form "cranes" does not map to either concept in English due to its ambiguity. In another language, which may have two different canonical forms for these concepts, mapping may not be a problem. Similarly, if "cranes" is an unambiguous canonical form in another language, then no ambiguity is presented in mapping. Cross references are mappings between general categories that are not directly ancestrally related, but that are close to each other ontologically. Direct ancestral relationship means parent-child, grandparent-grandchild, great grandparent-great grandchild, etc. Cross references reflect a real-world relationship or common association between the two general categories involved. These relationships can usually be expressed by universal or majority quantification over one category. Examples of valid cross references and the relationships are shown in Table 11.
TABLE 11
______________________________________
oceans --> fish (all oceans have fish)
belief systems --> moral states (all belief
systems address moral states)
electronics --> physics (all electronics deals
with physics)
death and burial --> medical problems (most
cases of death and burial are caused by medical
problems)
______________________________________
Cross references are not automatically bidirectional. For example, in the first entry of Table 11, although oceans are associated with fish, because all oceans have fish, the converse may not be true since not all fish live in oceans. The names for the general categories are chosen such that the cross references that involve those general categories are valid with the name or label choices. For example, if there is a word for fresh water fish in one language that is different than the word for saltwater fish, the oceans to fish cross reference is not valid if the wrong translation of fish is used. Although the knowledge base 155 is described as cross linking general categories, concepts may also be linked without deviating from the spirit and scope of the invention. FIG. 6 illustrates an example portion of a knowledge base augmented to include additional terminology as well as cross references and links among categories and terms. The classification hierarchy and notations shown in FIG. 6 illustrate an example that classifies a document on travel or tourism, and more specifically on traveling to France and visiting museums and places of interest. As shown in FIG. 6, the classification categories (e.g., knowledge catalog 150) contains two independent static ontologies, one ontology for "geography", and a second ontology for "leisure and recreation." The "geography" ontology includes categories for "political geography", "Europe", "Western Europe", and "France." The categories "arts and entertainment" and "tourism" are arranged under the high level category "leisure and recreation." The "visual arts" and the "art galleries and museums" are subcategories under the "arts and entertainment" category, and the category "places of interest" is a subcategory under the category "tourism." The knowledge base 155 is augmented to include linking and cross referencing among categories for which a linguistic, semantic, or usage association has been identified. For the example illustrated in FIG. 6, the categories "France", "art galleries and museums", and "places of interest" are cross referenced and/or linked as indicated by the circles, which encompass the category names, as well as the lines and arrows. This linking and/or cross referencing indicates that the categories "art galleries and museums" and "places of interest" may appear in the context of "France." For this example, the knowledge base 155 indicates that the Louvre, a proper noun, is classified under the category "art galleries and museums", and further associates the term "Louvre" to the category "France." Similarly, the knowledge base 155 indicates that the term "Eiffel Tower" is classified under the category "places of interest", and is also associated with the category "France." The knowledge base 155 may be characterized, in part, as a directed graph. The directed graph provides information about the linguistic, semantic, or usage relationships among categories, concepts and terminology. The "links" or "cross references" on the directed graph, which indicate the associations, is graphically depicted in FIG. 6 using lines and arrows. For the example shown in FIG. 6, the directed graph indicates that there is a linguistic, semantic, or usage association among the concepts "France", "art galleries and museums", and "places of interest." Content Processing System: FIG. 7 is a block diagram illustrating one embodiment for a content processing system. In general, the content processing system 110 analyzes the document set 130 and generates the document theme vector 160. For this embodiment, the content processing system 110 includes a linguistic engine 700, a knowledge catalog processor 740, a theme vector processor 750, and a morphology section 770. The linguistic engine 700 receives, as input, the document set 130, and generates, as output, the structured output 710. The linguistic engine 700, which includes a grammar parser and a theme parser, processes the document set 130 by analyzing the grammatical or contextual aspects of each document, as well as analyzing the stylistic and thematic attributes of each document. Specifically, the linguistic engine 700 generates, as part of the structured output 710, contextual tags 720, thematic tags 730, and stylistic tags 735 that characterize each document. Furthermore, the linguistic engine extracts topics and content carrying words 737, through use of the thematic tags 730, for each sentence in the documents. For a detailed description of the contextual and thematic tags, see U.S. Pat. No. 5,694,523, inventor Kelly Wical, entitled "Content Processing for Discourse", filed May 31, 1995, that includes an Appendix D, entitled "Analysis Documentation." In one embodiment, the linguistic engine 700 generates the contextual tags 720 via a chaos loop processor. All words in a text have varying degrees of importance in the text, some carrying grammatical information, and others carrying the meaning and content of the text. In general, the chaos loop processor identifies, for words and phrases in the documents, grammatical aspects of the documents including identifying the various parts of speech. In order to accomplish this, the chaos loop processor ascertains how the words, clauses and phrases in a sentence relate to each other. By identifying the various parts of speech for words, clauses, and phases for each sentence in the documents, the context of the documents is defined. The chaos loop processor stores information in the form of the contextual tags 720. U.S. Pat. No. 5,694,523, inventor Kelly Wical, entitled "Content Processing System for Discourse", filed May 31, 1995, includes an Appendix C, entitled "Chaos Processor for Text", that contains an explanation for generating contextual or grammatical tags. A theme parser within the linguistic engine 700 generates the thematic tags 730. Each word carries thematic information that conveys the importance of the meaning and content of the documents. In general, the thematic tags 730 identify thematic content of the document set 130. Each word is discriminated in the text, identifying importance or meaning, the impact on different parts of the text, and the overall contribution to the content of the text. The thematic context of the text is determined in accordance with predetermined theme assessment criteria that is a function of the strategic importance of the discriminated words. The predetermined thematic assessment criteria defines which of the discriminated words are to be selected for each thematic analysis unit. The text is then output in a predetermined thematic format. For a further explanation of a theme parser, see Appendix E, entitled "Theme Parser for Text", of U.S. Pat. No. 5,694,523, inventor Kelly Wical, entitled "Content Processing System for Discourse", filed May 31, 1995. As shown in FIG. 7, the morphology section 770 contains the knowledge catalog 150 and a lexicon 760. In one embodiment, the knowledge catalog 150 identifies categories for the document themes. For this embodiment, the knowledge catalog 150 contains categories, arranged in a hierarchy, that reflect a world view of knowledge. Appendix A of U.S. Pat. No. 5,694,523, inventor Kelly Wical, entitled "Content Processing System for Discourse", filed May 31, 1995, which is herein expressly incorporated by reference, is an example of a knowledge catalog for use in classifying documents. Although the present invention is described in conjunction with a knowledge catalog used to classify documents, any classification criteria that identifies topics or categories may be used in conjunction with the present invention without deviating from the spirit or scope of the invention. In general, the lexicon 760 stores definitional characteristics for a plurality of words and terms. For example, the lexicon 212 defines whether a particular word is a noun, a verb, an adjective, etc. The linguistic engine 700 uses the definitional characteristics stored in the lexicon 760 to generate the contextual tags 720, thematic tags 730, and the stylistic tags 735. An example lexicon, for use with a content processing system, is described in Appendix B, entitled "Lexicon Documentation", of U.S. Pat. No. 5,694,523, inventor Kelly Wical, entitled "Content Processing System for Discourse", filed May 31, 1995. The topics and content carrying words 737 are input to the knowledge catalog processor 740. In part, the knowledge catalog processor 740 processes the content carrying words for direct use with the knowledge catalog 150 and knowledge base 155. Specifically, the knowledge catalog processor 740 generates, as appropriate, the nominal or noun form of each content carrying word, as well as the count sense and mass sense of the word. Furthermore, the knowledge catalog processor 740 determines, from the knowledge catalog 150, which content carrying words are non ambiguous. As shown in FIG. 7, the theme vector processor 750 receives the thematic tags 730 and contextual tags 720 from the structured output 710. In addition, the non ambiguous content carrying words from the knowledge catalog processor 740 are input to the theme vector processor 750. The content carrying words may include single words or phrases. The content carrying words output from the knowledge catalog processor 240 are converted to the noun or nominal form. In general, the theme vector processor 750 presents a thematic profile of the content of each document (e.g., generates the document theme vector 160 including classifying the documents in the knowledge catalog 150. To accomplish this, the theme vector processor 750 determines the relative importance of the non ambiguous content carrying words in the document set. In one embodiment, the theme vector processor 750 generates a list of theme terms, including words and phrases, and assigns a relative theme strength to each theme term. The theme vector processor 750, through use of the knowledge catalog 150, generates a theme concept for each theme term by mapping the theme terms to categories in the knowledge catalog 150. Thus, the theme concepts indicate a general topic or category in the knowledge catalog 150 to identify the content of each document. In addition, the theme vector processor 750 generates, for each theme term, an importance number, a theme strength, and an overall capacity weight of collective content importance. Table 12 is an example document theme vector 160.
TABLE 12
______________________________________
Document Theme Vector
Document Theme
Themes Strength Classification Category
______________________________________
Theme.sub.1
190 (category.sub.a)
Theme.sub.2 110 None
Theme.sub.3 70 (Category.sub.c)
Theme.sub.4 27 (Category.sub.d)
. . .
. . .
. . .
Theme.sub.n 8 (Category.sub.z)
______________________________________
As shown in Table 12, a document theme vector 160 for a document includes a list of document themes, indicated in Table 1 by Theme.sub.1 -Theme.sub.n. Each theme has a corresponding theme strength. The theme strength is calculated in the theme vector processor 750. The theme strength is a relative measure of the importance of the theme to the overall content of the document. For this embodiment, the larger the theme strength, the more important the theme is to the overall content of the document. The document theme vector 160 lists the document themes from the most important to the least important themes (e.g., theme.sub.1 -theme.sub.n). The document theme vector 160 for each document further includes, for some themes, a category for which the theme is classified. The classification category is listed in the third column of the document theme vector shown in Table 12. For example, theme.sub.1 is classified in category.sub.a, and theme.sub.3, is classified in category.sub.o. In one embodiment, the theme vector processor 750 executes a plurality of heuristic routines to generate the theme strengths for each theme. U.S. Pat. No. 5,694,523, inventor Kelly Wical, entitled "Content Processing System for Discourse", contains source code to generate the theme strengths in accordance with one embodiment for theme vector processing. Also, a further explanation of generating a thematic profile is contained in U.S. Pat. No. 5,694,523, inventor Kelly Wical, entitled "Content Processing System for Discourse", filed May 31, 1995, which is herein incorporated by reference. Computer System: FIG. 8 illustrates a high level block diagram of a general purpose computer system in which the information retrieval system of the present invention may be implemented. A computer system 1000 contains a processor unit 1005, main memory 1010, and an interconnect bus 1025. The processor unit 1005 may contain a single microprocessor, or may contain a plurality of microprocessors for configuring the computer system 1000 as a multi-processor system. The main memory 1010 stores, in part, instructions and data for execution by the processor unit 1005. If the information retrieval system of the present invention is wholly or partially implemented in software, the main memory 1010 stores the executable code when in operation. The main memory 1010 may include banks of dynamic random access memory (DRAM) as well as high speed cache memory. The computer system 1000 further includes a mass storage device 1020, peripheral device(s) 1030, portable storage medium drive(s) 1040, input control device(s) 1070, a graphics subsystem 1050, and an output display 1060. or purposes of simplicity, all components in the computer system 1000 are shown in FIG. 8 as being connected via the bus 1025. However, the computer system 1000 may be connected through one or more data transport means. For example, the processor unit 1005 and the main memory 1010 may be connected via a local microprocessor bus, and the mass storage device 1020, peripheral device(s) 1030, portable storage medium drive(s) 1040, graphics subsystem 1050 may be connected via one or more input/output (I/O) busses. The mass storage device 1020, which may be implemented with a magnetic disk drive or an optical disk drive, is a non-volatile storage device for storing data and instructions for use by the processor unit 1005. In the software embodiment, the mass storage device 1020 stores the information retrieval system software for loading to the main memory 1010. The portable storage medium drive 1040 operates in conjunction with a portable non-volatile storage medium, such as a floppy disk or a compact disc read only memory (CD-ROM), to input and output data and code to and from the computer system 1000. In one embodiment, the information retrieval system software is stored on such a portable medium, and is input to the computer system 1000 via the portable storage medium drive 1040. The peripheral device(s) 1030 may include any type of computer support device, such as an input/output (I/O) interface, to add additional functionality to the computer system 1000. For example, the peripheral device(s) 1030 may include a network interface card for interfacing the computer system 1000 to a network. For the software implementation, the documents may be input to the computer system 1000 via a portable storage medium or a network for processing by the information retrieval system. The input control device(s) 1070 provide a portion of the user interface for a user of the computer system 1000. The input control device(s) 1070 may include an alphanumeric keypad for inputting alphanumeric and other key information, a cursor control device, such as a mouse, a trackball, stylus, or cursor direction keys. In order to display textual and graphical information, the computer system 1000 contains the graphics subsystem 1050 and the output display 1060. The output display 1060 may include a cathode ray tube (CRT) display or liquid crystal display (LCD). The graphics subsystem 1050 receives textual and graphical information, and processes the information for output to the output display 1060. The components contained in the computer system 1000 are those typically found in general purpose computer systems, and in fact, these components are intended to represent a broad category of such computer components that are well known in the art. The information retrieval system may be implemented in either hardware or software. For the software implementation, the information retrieval system is software that includes a plurality of computer executable instructions for implementation on a general purpose computer system. Prior to loading into a general purpose computer system, the information retrieval system software may reside as encoded information on a computer readable medium, such as a magnetic floppy disk, magnetic tape, and compact disc read only memory (CD-ROM). In one hardware implementation, the information retrieval system may comprise a dedicated processor including processor instructions for performing the functions described herein. Circuits may also be developed to perform the functions described herein. The knowledge catalog 150 and knowledge database 155 may be implemented as a database stored in memory for use by the information retrieval system. Although the present invention has been described in terms of specific exemplary embodiments, it will be appreciated that various modifications and alterations might be made by those skilled in the art without departing from the spirit and scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
