Methods, apparatus, and data structures for annotating a database design schema and/or indexing annotations6999963Abstract An authoring tool (or process) to facilitate the performance of an annotation function and an indexing function. The annotation function may generate informational annotations and word annotations to a database design schema (e.g., an entity-relationship diagram or "ERD"). The indexing function may analyze the words of the annotations by classifying the words in accordance with a concordance and dictionary, and assign a normalized weight to each word of each of the annotations based on the classification(s) of the word(s) of the annotation. A query translator (or query translation process) to (i) accept a natural language query from a user interface process, (ii) convert the natural language query to a formal command query (e.g., an SQL query) using the indexed annotations generated by the authoring tool and the database design schema, and (iii) present the formal command query to a database management process for interrogating the relational database. Claims What is claimed is: Description COPYRIGHT NOTICE
Although not shown in the relations, each restaurant may have other attributes such as a star rating (e.g., *, **, ****, or *****), a cost rating (e.g., $,$$, $$$, $$$$, or $$$$$) and special options (e.g., Good Deal, Child Friendly, New, Romantic, 24-Hour, Afternoon Tea, Brunch, Delivery, Late Night, Live Entertainment, Noteworthy Wine List, Outdoor Seating, Pre-Theater Menu, Prix Fixe, Smoke Free, Smoke Friendly, View, etc.) In the relation 600, a neighborhood ID number is associated with a particular neighborhood and the person/place ID number is associated with a person or place. For example, neighborhood ID number 14 corresponds to the "Financial District" neighborhood of New York City. The following table lists exemplary New York City neighborhoods and associated ID numbers.
Having briefly described relational databases and associated terminology, an exemplary database design scheme is briefly discussed. Entity relation diagrams (or "ERDs") provide a semantic model of data in a database and are often used in database design. Semantic modeling permits a database to (i) respond more intelligently to user interactions, and (ii) support more sophisticated user interfaces. ERDs were introduced in the paper, Peter Pin-Shan Chen, "The Entity Relationship Model-Toward a Unified View of Data," International Conference on Very Large Data Bases, Framingham, Mass., (Sep. 22-24, 1975), reprinted in Readings in Database Systems, Second Edition, pp. 741-754, edited by in Michael Stonebraker, Morgan Kaufmann Publishers, Inc., San Francisco, Calif. (1994) (hereafter referred to as "the Chen paper"). Basically, the Chen paper defines an "entity" as a thing that can be distinctly identified. A "weak entity" is defined as an entity whose existence depends on some other entity. An entity may have a "property" or an "attribute" which draws its value from a corresponding value set. A "relationship" is an association among entities. Entities involved in a given relationship are "participants" in that relationship. The number of participating entities in a relationship defines the "degree" of the relationship. In entity relationship diagrams defined in accordance with the Chen paper, entities are depicted with rectangles, properties are depicted with ellipses, and relationships are depicted with diamonds. Exemplary entity relationship diagrams are shown in FIGS. 7 and 8. FIG. 7 depicts an exemplary entity relationship diagram 700 of a restaurant database. As shown, the "restaurant" entity has "rating", "cuisine type" and "special option" attributes or properties. As denoted by the "n:1" between the restaurant entity and its rating attribute, each restaurant has only one rating, though more than one restaurant may have the same rating. As denoted by the "n:m" between the restaurant entity and its cuisine type attribute, each restaurant may have more than one cuisine type, and more than one restaurant may offer the same cuisine type. Similarly, as denoted by the "n:m" between the restaurant entity and its special options attribute, each restaurant may have more than one special option, and more than one restaurant may have the same special option. Further, as shown in FIG. 7, the restaurant and cost entities are participants in a "has a" relationship. As depicted by the "n:1" of the "has a" relationship, each restaurant has only one cost, but more than one restaurant may have the same cost. FIG. 8 depicts an exemplary entity relationship diagram 800 of a neighborhood database. As shown, the "person/place" entity has a "neighborhood" attribute or property. As denoted by the "n:m" between the person/place entity and its neighborhood attribute, each person/place may have more than one neighborhood, and more than one person/place may be in the same neighborhood. For example, there may be many McDonalds restaurants throughout various neighborhoods in New York City. Tools exist to semantically design databases and/or the extract semantic information from an existing database. For example, InfoModeler, now part of Visio 2000 from Microsoft Corporation of Bellevue, Washington uses ER or object role modeling for designing, optimizing, or re-engineering databases. Also, the ERDwin product can be used to generate databases from ERDs and vice-versa. Since different terms are used in the relational database and ERD vernacular, and since similar terms may have different meanings in the different contexts (e.g., "relation" and "relationship"), in the following description, terms should be interpreted as follows, unless indicated otherwise:
Having reviewed relational databases and semantic database design information (e.g., ERDs), a system of the present invention will now be described with reference to FIG. 4. The database design schema (e.g., an ERD) 420 is provided to an annotation authoring process 440. A design schema, such as that disclosed in U.S. patent application Ser. No. 09/325,166, entitled "METHODS, APPARATUS AND DATA STRUCTURES FOR PROVIDING A UNIFORM REPRESENTATION OF VARIOUS TYPES OF INFORMATION", filed on Jun. 3, 1999 and listing Edward Jung (incorporated herein by reference) may be used. The annotation authoring process (or more generally, an annotating facility) 440 uses automated annotation rules 442 and/or user annotations received from a user interface process (or more generally, a user interface) 430 to generate annotations to the database design schema (e.g., ERD) 420. An indexing process (or more generally, an indexing facility) 445 analyzes the words of the annotations, using a dictionary and a concordance 446, to enhance the understanding of natural language queries. The indexing process 445 produces indexed annotations 460. Other, non-indexed annotations 460′, such as probabilities and/or entities 462′ associated with tables, rows, and/or columns, may also be generated by the annotation authoring process 440. However, these annotations 460′ are not indexed. Each of the annotations may include a word 462 and an associated table, row, column, and/or expression 464. Both the annotation authoring process 440 and the indexing process 445 (collectively referred to as "a database authoring process" or "authoring tool") are described in § 4.3 below. Once the database authoring process is complete and indexed annotations 460 are available, the system 400 may process natural language queries accepted via user interface process 430. The natural language query is provided to a query translation process (or more generally, a query translator) 450 which uses the indexed annotations 460 and the database design schema 420 (and the dictionary and/or concordance 446) to generate formal query command(s) (e.g., structured query language (or "SQL") query commands). The formal query command(s) is then provided to a database management process (or more generally, a database manager) 470 which uses the formal query command(s) to interrogate the relational database 410. The results to the formal query command(s) are provided to the database management process 470, which forwards such results to a user interface process 430 for presentation to the user. The query translation process 450 is described in § 4.4 below. § 4.3 Authoring Tool An exemplary authoring tool, which may include an annotation/authoring process 440 and an indexing process 445, is discussed below. More specifically, functions which may be carried out by the exemplary authoring tool are introduced in § 4.3.1. A structure of the exemplary authoring tool is described in § 4.3.2. Finally, examples for illustrating some of the operations of the exemplary authoring tool are described in § 4.3.3. § 4.3.1 Functions of the Authoring Tool In this section, the basic functions which may be performed by the exemplary authoring tool will be briefly described. Basically, the authoring tool may perform an annotation function and an indexing function. The annotation function generates informational annotations and word annotations to the database design schema (e.g., an ERD) 420. Note that semantic information related to the design of a database differs from the semantic (or syntactic) information related to grammar or linguistics used in interpreting natural language query interfaces. Informational annotations (i) distinguish tables corresponding to entities and those corresponding to properties (or attributes) in the database 410, (ii) attach, to rows of the tables, a probability that the row will be referenced, and/or (iii) describe entities in a way that is meaningful to humans. Word annotations attach related words to tables, rows, columns, or relationships of the database design schema. The indexing function analyzes the words of automatically generated annotations by classifying the words in accordance with a concordance and dictionary, and assigning a normalized weight to each word of each of the annotations based on the classification(s) of the word(s) of the annotation. In the exemplary embodiment disclosed here, manually generated annotations must completely match a natural language query. That is, in the exemplary embodiment disclosed here, manually generated annotations have a weight of 1.0 for the phrase in its entirety. Manual and automatic annotations are treated differently because it is believed that manually generated annotations will be more precise and be the product of more thought, while automatically generated annotations have more opportunity for imprecision. Naturally, the requirement for exact matching for manually generated annotations may merely be a default parameter that may be changed such that exact matching is not necessary. In this case, manually generated annotations will be similarly analyzed and provided with a normalized weight. § 4.3.2 Structures/Methodologies of the Exemplary Authoring Tool Having introduced various functions which may be performed by the authoring tool, structures and methodologies of the authoring tool will now be described. Recall that the authoring tool basically includes an annotation/authoring process 440 and an indexing process 445. An exemplary structure or methodology of the annotation/authoring process 440 will be described in § 4.3.2.1 below. Then, an exemplary structure or methodology of the indexing process 445 will be described in § 4.3.2.2 below. § 4.3.2.1 Exemplary Structure/Methodology For the Annotation/Authoring Process Recall that the annotation/authoring process 440 uses user inputs from a user interface process 430 and automated annotation rules 442 to generate annotations to a database design schema (e.g., an ERD) 420. FIG. 10 is a flow diagram of an annotation/authoring process 440′. First, as shown in step 1010, informational annotations are requested and/or accepted from the user (e.g., via a user interface process 430 which may include an ERD editor, a keyboard, and a video monitor). Basically, three (3) types of informational annotations may be requested and/or accepted; namely, entity annotations, prior annotations, and description annotations. Entity annotations distinguish tables corresponding to entities, and those corresponding to properties (or attributes). Entities may be thought of as nouns while properties may be thought of as adjectives (or attributes of the entity). Thus, for example, since a movie has a rating, "movie" would be an entity and "rating" would be an property. Prior annotations are attached to each row of each table and represent the probability that a reference will be made to that row with respect to all other rows in the table. The automated annotation rules 442 may instruct the annotation/authoring process 440 to automatically determine uniform probabilities for all rows within a table that do not have explicit prior annotations. The uniform probabilities are determined from probabilities associated with the tables. These prior annotations (probabilities) may be updated based on actual usage data. For example, priors can be assigned to a row (e.g., row 37 in restaurants=0.0017) or tables (e.g., P(reference to any restaurant row)=0.37). If a table has a given prior, row priors can be subtracted from the table prior. Then, the difference can be uniformly divided over the other rows. Description annotations describe entities in a way that is meaningful to humans. For example, a description annotation may denote the "name" column in the restaurant table. This is a human readable way to refer to a specific restaurant in the table. Returning to FIG. 10, after informational annotations have been requested and/or accepted, word annotations are requested and/or accepted as shown in step 1020. Word annotations are attached to tables, rows, columns, or relations between entities. Word annotations may be divided into two (2) types; namely manual and automatic. Manual word annotations are created by a human author. A dictionary or thesaurus can be used to suggest words to use as annotations. For example, the word restaurant has a synonym of "eating house" and a hypernym of "building". A lexicographic database, such as WordNet, created by Princeton University of Princeton, N.J., for example, may be used to automate this process, at least to some extent. For example, lexicographer files in Wordnet organize nouns, verbs, adjectives, and adverbs into synonym groups, and describe relations between synonym groups. Next, as shown in step 1030, automatic word annotations are automatically generated from the database 410 in accordance with the automated annotation rules 442. More specifically, information contained in columns/rows may be used to automatically annotate those columns/rows. For example, in a table having a column with movie names, each row may be labeled with a movie name. Processing then continues via return node 1040 § 4.3.2.2 Exemplary Structure/Methodology of the Indexing Process FIG. 11A is a high level flow diagram of an exemplary indexing process 445′. Indexing is based on the words of an annotation. Since "entity" and "prior" "informational" annotations are typically one word (or number), these annotations are not indexed. Rather, automatic word annotations based on the values of rows, as well as manual word annotations, may be indexed. First, as shown in step 1110, the words of the annotations are classified. This classification facilitates a more sophisticated recognition since some words are more important than others in a natural language query. In general, it has been recognized that the more rare or distinct a word is, the more important it is for purposes of searching a database. For example, the name of a restaurant may be "Azteca Mexican Family Restaurant". The queries "Azteca", "Azteca Restaurant", "Azteca Family Restaurant" or "Azteca Mexican Restaurant" should generate a "match" to this particular restaurant. On the other hand, it may be desired that the queries "Mexican Restaurant" or "Mexican Family Restaurant" should not (only) generate a "match" to this particular restaurant. This is because "Azteca" is a much more unique or distinct word than "Mexican", which is itself more unique or distinct than "Family" or "Restaurant". During the classification step, each word may be classified into one (1) of eight (8) classes; namely, unique, proper, class, stop, rare, infrequent, frequent, and normal (or common). Referring back to FIG. 4, recall that the indexing process 445 uses a dictionary and concordance 446 (such as WordNet developed by Princeton University, or MindNet developed by Microsoft Corporation). Unique words are either possessives (e.g., "Bob's") or words that are not in the dictionary 446 (e.g., "Azteca"). Proper words are those defined as proper nouns in the dictionary. Class words are defined as words that have been used to annotate all rows in a table. Stop words (e.g., "the", "what") are defined as words that are so common, that they have little semantic meaning for identifying entities. The remaining classifications, rare, infrequent, frequent, and normal, are derived from the frequency of the word and its sense in the concordance 446. The concordance may be built and/or updated from actual queries. However, the concordance will generally be derived from a large sample of text. The words of each annotation to be indexed are then provided with normalized weights as shown in step 1120. FIG. 11B is a flow diagram of an exemplary process 1120′ for providing normalized weights (e.g., between 0.00 and 1.00) to words of an annotation. First, as shown in step 1130, it is determined whether the annotation contains any words classified as "unique". If, as shown in steps 1132 and 1134, there is more than one "unique" word in the annotation, each of the "unique" words is assigned a normalized weight of 0.75 divided by the number of "unique" words in the annotation. If, on the other hand, as shown in steps 1132 and 1136, there is only one "unique" word in the annotation, that word is assigned a normalized weight of 0.50. The process continues at step 1140 in which the remaining weight (i.e., 1.00 if there are no unique words in the annotation, 0.50 if there is one unique word in the annotation, and 0.25 if there is more than one unique word in the annotation) is apportioned to the remaining words of the annotation. For example, each "class" word gets 1 share of the remaining weight, each "stop" word gets 1 share of the remaining weight, each "common" (or normal) word gets 2 shares of the remaining weight, each "frequent" word gets 3 shares of the remaining weight, each "infrequent" word gets 4 shares of the remaining weight, each "rare" word gets 5 shares of the remaining weight, and each "proper" word also gets 5 shares of the remaining weight. Thus, the weight of a (non-unique) word may be expressed as: ##EQU1## ##EQU2## Finally, step 1150 limits the normalized weight assigned to stop, class, and common (or normal) words to 0.10. Processing continues via return node 1160. Referring back to FIG. 11A, once the normalized weights are determined for words of descriptions, processing continues via return node 1125. FIG. 9A and the following discussion provide a brief, general description of an exemplary apparatus in which at least some aspects (e.g., at least some of the described processes) of the present invention may be implemented. The present invention will be described in the general context of computer-executable instructions, such as program modules, being executed by a personal computer. However, the methods of the present invention may be effected by (and the data structures of the present invention may be stored on) other apparatus. Program modules may include routines, programs, objects, components, data structures, etc. that perform a task(s) or implement particular abstract data types. Moreover, those skilled in the art will appreciate that at least some aspects of the present invention may be practiced with other configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network computers, minicomputers, set-top boxes, mainframe computers, and the like. At least some aspects of the present invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote memory storage devices. With reference to FIG. 9A, an exemplary apparatus 900 for implementing at least some aspects of the present invention includes a general purpose computing device in the form of a conventional personal computer 920. The personal computer 920 may include a processing unit(s) 921, a system memory 922, and a system bus 923 that couples various system components including the system memory 922 to the processing unit 921. The system bus 923 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system memory may include read only memory (ROM) 924 and/or random access memory (RAM) 925. A basic input/output system 926 (BIOS), containing basic routines that help to transfer information between elements within the personal computer 920, such as during start-up, may be stored in ROM 924. The personal computer 920 may also include a hard disk drive 927 for reading from and writing to a hard disk, (not shown), a magnetic disk drive 928 for reading from or writing to a (e.g., removable) magnetic disk 929, and an optical disk drive 930 for reading from or writing to a removable (magneto-) optical disk 931 such as a compact disk or other (magneto-) optical media. The hard disk drive 927, magnetic disk drive 928, and (magneto-) optical disk drive 930 may be coupled with the system bus 923 by a hard disk drive interface 932, a magnetic disk drive interface 933, and a (magneto-) optical drive interface 934, respectively. The drives and their associated storage media provide nonvolatile storage of machine readable instructions, data structures, program modules and other data for the personal computer 920. Although the exemplary environment described herein employs a hard disk, a removable magnetic disk 929 and a removable optical disk 931, those skilled in the art will appreciate that other types of storage media (with appropriate interface devices), such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories (RAMs), read only memories (ROM), and the like, may be used instead of, or in addition to, the storage devices introduced above. A number of program modules may be stored on the hard disk drive 927, magnetic disk 929, (magneto-) optical disk 931, ROM 924 or RAM 925, such as an operating system 935, one or more application programs 936, other program modules 937, and/or program data 938 for example. A user may enter commands and information into the personal computer 920 through input devices, such as a keyboard 940 and pointing device 942 (e.g., a mouse) for example. Other input devices (not shown), such as a microphone, a joystick, a game pad, a satellite dish, a scanner, or the like, may also (or alternatively) be included. These and other input devices are often connected to the processing unit 921 through a serial port interface 946 coupled to the system bus 923. However, input devices may be connected by other interfaces, such as a parallel port, a game port or a universal serial bus (USB). A monitor 947 or other type of display device may also be connected to the system bus 923 via an interface, such as a video adapter 948 for example. In addition to (or instead of) the monitor 947, the personal computer 920 may include other (peripheral) output devices (not shown), such as speakers and printers for example. The personal computer 920 may operate in a networked environment which defines logical and/or physical connections to one or more remote computers, such as a remote computer 949. The remote computer 949 may be another personal computer, a server, a router, a network computer, a peer device or other common network node, and may include many or all of the elements described above relative to the personal computer 920. The logical and/or physical connections depicted in FIG. 9A include a local area network (LAN) 951 and a wide area network (WAN) 952. An intranet and the Internet may be used instead of, or in addition to, such networks. When used in a LAN, the personal computer 920 may be connected to the LAN 951 through a network interface adapter (or "NIC") 953. When used in a WAN, such as the Internet, the personal computer 920 may include a modem 954 or other means for establishing communications over the wide area network 952. The modem 954, which may be internal or external, may be connected to the system bus 923 via the serial port interface 946. In a networked environment, at least some of the program modules depicted relative to the personal computer 920 may be stored in the remote memory storage device. The network connections shown are exemplary and other means of establishing a communications link between the computers may be used. FIG. 9B is a more general machine 900 which may effect one or more of the processes discussed above. The machine 900 basically includes a processor(s) 902, an input/output interface unit(s) 904, a storage device(s) 906, and a system bus or network 908 for facilitating the communication of information among the coupled elements. The processor(s) 902 may execute machine-executable instructions to effect one or more aspects of the present invention. At least a portion of the machine executable instructions may be stored (temporarily or more permanently) on the storage device(s) 906 and/or may be received from an external source via an input interface unit 904. Having now described exemplary structures of both the annotation/authoring process 440 and the indexing process 445 of the authoring tool, an operational example of the authoring tool is now provided in § 4.3.3 below. § 4.3.3 Operations of the Authoring Tool An example for illustrating the operation of the annotation/authoring process 440 of the authoring tool will be described in § 4.3.3.1 below. Then, an example for illustrating the operation of the indexing process 445 of the authoring tool will be described in § 4.3.3.2 below. § 4.3.3.1 Exemplary Operation of the Annotation/Authoring Process The operation of the annotation/authoring process 440, in the context of a database of movies and restaurants, will now be described with reference to FIGS. 10 and 14 and the attached Appendix A. FIG. 14 is an entity relationship diagram (ERD) of a database having tables with information related to restaurants and movies. The attached Appendix A is a file of a Prolog definition of a database substantially similar to the ERD of FIG. 14, which has been annotated in accordance with the present invention. Portions of Appendix A are reprinted throughout this discussion to facilitate the discussion of the exemplary annotation/authoring process 440. Examples of informational annotations will be presented in § 4.3.3.1.1. Thereafter, examples of word annotations will be discussed in § 4.3.3.1.2. § 4.3.3.1.1 Informational Annotations Recall that three (3) types of informational annotations may be made to the database design schema 420; namely entity annotations, prior annotations, and description annotations. Examples of each of these informational annotations will now be described with reference to the following two (2) tables (defined in Prolog), as well as commands from Appendix A:
In the first table, the first line "%% Restaurant" is merely a comment line. The second line:
In the second table, the first line
Recall that entity annotations distinguish entities and properties. In general, entities may correspond to nouns while properties may correspond to adjectives (or attributes of entities), though this distinction is not necessary. In the tables described above, the "is a" relation, <==>, will infer that the table is an entity. On the other hand, the "has an attribute" relation, <==, as well as the description line, desc, will infer that the table is a property. Thus, tables may be identified, automatically, as either entities or properties based on the foregoing. However, there are some database schemas where these assumptions would not necessarily be true. In such cases, tables should be explicitly (e.g., manually) identified as entities or properties. Finally, recall that prior annotations attach to each row of a table, a probability that the row will be referenced with respect to all possible rows. A prior annotation from Appendix A is reprinted below:
§ 4.3.3.1.2 Word Annotations Examples of word annotations are set forth below with reference to FIG. 14 and selected lines from Appendix A. First, examples of word annotations to tables (defined in Prolog) from Appendix A are reprinted below:
Examples of word annotations to columns from Appendix A are reprinted below:
Examples of word annotations to variables from Appendix A are reprinted below:
Automatic word annotations are made to the tables based on information in the database. For example, the PhoneType table 1430 is automatically annotated with information from the database from its rows. In the example depicted in FIG. 14, the PhoneType table 1430 is automatically annotated with the words "Business", "Fax", "Reservation", etc. § 4.3.3.2 Operation of the Indexing Process Referring to FIGS. 11A and 11B, assume that the restaurant name "Toronto's Maple-liscious All You Can Eat Flap-Jack House" is a value in a restaurant name column in a table of a database. Recall that the automatic word annotation will annotate the row with its contents. Assume further that "Toronto's" is a proper word, "Maple-licious" is a unique word, "All" is a stop word, "You" is a common word", "Can" is a stop word, "Eat" is a common word, "Flapjack" is a rare word, and "House" is a frequent word. Referring to steps 1130, 1132, and 1136 of FIG. 11B, since "Maple-liscious" is the only unique word of the annotation, it would be assigned a normalized weight of 0.50. The remaining weight, 0.50, will be apportioned among the remaining words. The sum of all of the shares of the remaining words is 17 (i.e., 5+1+2+1+2+5+3). Accordingly, the normalized weights assigned to "Toronto's" and "Flap-Jack" would each be 0.15 (i.e., 5/17*0.50), the normalized weight assigned to "House" would be 0.09 (i.e., 3/17*0.50), the normalized weights assigned to "You" and "Eat" would each be 0.06 (i.e., 2/17*0.50), and the normalized weights assigned to "All" and "Can" would each be 0.03 (i.e., 1/17*0.50). Having described the function, structure and operations of the authoring tool, the functions, structure, and operations of the query translator is now described in § 4.4 below. § 4.4 Query Translator An exemplary query translator is described below. More specifically, functions which may be performed by the query translator are introduced in § 4.4.1. Then, a structure of the exemplary query translator is described in § 4.4.2. Finally, an exemplary operation of the exemplary query translator is set forth in § 4.4.3. § 4.4.1. Functions of the Query Translator Referring back to FIG. 4, the query translator (or query translation process 450) may function to (i) accept a natural language query from a user interface process 430, (ii) convert the natural language query to a formal command query (e.g., an SQL query) using the indexed annotations 460 (and other annotations 460′) generated by the authoring tool and the database design schema 420, and (iii) present the formal command query to the database management process 470 for interrogating the relational database 410. As will become apparent in the following description, the conversion function is the main function of the query translator. A structure and methodology of an exemplary query translator is presented below. § 4.4.2 Structure/Methodology of the Exemplary Query Translator FIG. 12 is a high level flow diagram of an exemplary query translation process 450′. Generally, the natural language query accepted from the user interface process 430 is an alphanumeric string. For example, if the user interface process 450 accepts inputs from a keyboard or keypad, the user enters the alphanumeric string by manipulating the keys of the keyboard or keypad. If the user interface process 450 accepts inputs from a pressure sensitive tablet and character recognition means, the user enters the alphanumeric string by writing letters and/or numbers on the tablet. If the user interface process accepts inputs from a microphone and speech recognition means, the user enters the alphanumeric string by speaking into the microphone. Naturally, other types of user interfaces may be used to accept a natural language query. Referring first to step 1210 of FIG. 12, the accepted string is parsed. Details of an exemplary string parser are described in § 4.4.2.1 below with reference to FIG. 13. Then, as shown in step 1220, phrases determined from the parsed string are checked against the stored indexed annotations to determine whether any of the phrases "match" any of the indexed annotations. Since, as shown in FIG. 4, the records of the indexed annotations associate a word 462 with a table, row, column, or expression of the relational database 410, when a "match" is determined, a selected one of the associated data (also referred to as a "fragment" or as a "pattern") 464 is returned. As shown in step 1225, for each word or phrase of the parsed string, a clique (or set) of pattern objects (or a group of rank ordered fragments) is generated. Details of an exemplary matching (and fragment ranking) facility are described in § 4.4.2.2 below with reference to FIG. 15. In step 1230, the patterns of the cliques (or fragments from the groups of rank ordered fragments) are "combined" (or "chained") together, based on the database design schema 420, to generate a formal command query. Details of a chaining process (or more generally, a chaining facility) are described in § 4.4.2.3 below with reference to FIGS. 16 through 23. Finally, as shown in step 1240, objects of the chained query are marked in accordance with a presentation process (or facility). Processing continues via return node 1250. In this way, a natural language query is converted to a formal command query. FIG. 17 is a diagram of processes that may be used by the query translation process 450 of FIG. 4. A string parsing process (or more generally, a string parser) 1210′ accepts a natural language query from the user interface process 430, and provides a parsed string to a word or phrase-annotation match determination process (or more generally, a match determination facility) 1220′. The word or phrase-annotation match determination process 1220′ determines matches between words and/or phrases of the parsed string and indexed annotations 460. The word or phrase-annotation match determination process 1220′ may also use the dictionary and concordance 446 as will be explained later. The matching indexed annotations have associated patterns (or fragments) which may include one or more objects (e.g., tables, columns, rows, relationship, and/or expression) of the relational database 410 or the database design schema 420. In one exemplary embodiment, for each word or phrase of the parsed query, associated fragments may be rank ordered by the fragment ranking process (or more generally, the fragment ranking facility) 1225′. The groups of rank ordered fragments are then provided to a fragment chaining process (or more generally, the fragment chainer) 1230′ which attempts to chain the groups of rank ordered fragments based on the database design schema 420. Alternatively, the patterns generated from the word or phrase match determination process 1220′ may be provided to the optimized (pattern) combination process 1230′. Finally, certain objects of the combined patterns (or chained fragments) are marked for presentation by an object marking process (or more generally, a presentation facility) 1240′. The marked objects of the combined patterns (or chained fragments) are converted to a formal query command(s) and presented to a database management process 470 for interrogation of the relational database 410. Referring back to FIG. 9B, the processor(s) 902 may effect the processes of the query translation process 450 by executing machine-executable instructions. At least a portion of the machine-executable instructions may be stored on the storage device(s) 906 and/or may be received from an external source via an input interface 904. § 4.4.2.1 String Parser FIG. 13 is a flow diagram of an exemplary process 1210′ for parsing an alphanumeric string accepted by the query translator 450′. Although, as shown in optional alternative step 1310a, punctuation in the input string may be considered, all punctuation is removed (or ignored) in our exemplary parser as shown in step 1310b (except for identifying a possessive such as "Bob's" for example). Similarly, although, as shown in optional alternative step 1320a, capitalization may also be considered, case is normalized (or ignored) in our exemplary parser as shown in step 1320b. In this way, a user does not need to be concerned with whether their query will be case sensitive or whether their query will require punctuation. It is believed that a natural language interface made and operated in accordance with the present invention will be robust enough to operate well without the semantic information which may be conveyed by case or punctuation. This robustness is particularly important if the user interface process 430 employs speech recognition because case and punctuation information might not exist. Finally, as shown in step 1330, the parsed words may be stemmed so that only their roots are left. Processing continues via return node 1340. § 4.4.2.2 Phrase Matcher/Ranker FIG. 15, which includes FIGS. 15A and 15B, is a flow diagram of an exemplary process 1220′/1225′ for generating groups of rank ordered fragments 464 (or simply sets or cliques of pattern objects) associated with the database design schema 420 based on "matches" between phrases in the parsed string and the indexed annotations 460. The exemplary process 1220′/1225′ flows as follows. First, as shown in step 1502, values are initialized. In the following, PWORD will refer to a particular word of the parsed string, ANNOT will refer to a particular annotation in the indexed annotations 460, AWORDANNOT will refer to a particular word of the particular annotation (ANNOT), FRAGSET will be a set of fragments (or pattern objects), and ORDEREDFRAGSET will be a set or group of rank ordered fragments from FRAGSET. Thus, step 1502 may initialize these terms by initializing PWORD to the first (e.g., leftmost) word of the parsed string, initializing ANNOT to the first word of the indexed annotations 460, initializing AWORDANNOT to the first word of the annotation ANNOT, initializing FRAGSET as an empty set and initializing ORDEREDFRAGSET as an empty set. Next, as shown in decision step 1504, it is determined whether PWORD is the same as AWORDANNOT. Note that, in each case, the words may be the root (or stem) of the words. If PWORD is not the same as AWORDANNOT, processing branches to decision step 1506. As shown in steps 1506 and 1508, if the annotation has another word, then the annotation word AWORDANNOT is set to that next word and processing continues at decision step 1504. Referring back to decision step 1506, if there are no more annotation words AWORDANNOT in the particular annotation ANNOT, then processing branches to decision step 1510. As shown in steps 1510 and 1512, if there are other annotations remaining in the indexed annotations, then the annotation ANNOT is set to the next of the remaining annotations and the annotation word AWORDANNOT is set to the first word of the new annotation ANNOT. Processing the continues, once again, at decision step 1504. Referring back to decision step 1510, if, on the other hand, there are no remaining annotations in the indexed annotations 460, then processing branches to decision step 1513. Decision step 1513 determines whether the dictionary 446 indicates that the current word PWORD of the parsed string is a noun or adjective (or an "open" class word). If PWORD is a noun or adjective, it is assumed that PWORD is important and, as shown in step 1514, an "unknown word" error is generated and presented to the user. Processing then proceeds to decision step 1540. If, on the other hand, PWORD is not a noun or adjective, processing continues at decision step 1540. The following example illustrates the importance of steps 1513 and 1514. If unrecognized words (i.e., words that do not match any annotations) were merely ignored, misleading query results could be generated. For example, if the query "restaurants with food from Chad" were entered and if Chad was not recognized (i.e., if Chad does not match any of the annotations), the result would return all restaurants in the database 410 and their associated cuisine types. As a further example of what would occur if steps 1513 and 1514 were not performed, the query "Chad restaurant" would return all restaurants without their cuisine type. In this case, the user might incorrectly assume that all of the restaurants serve food from Chad. On the other hand, recall that only unrecognized nouns and adjectives generate error messages. Otherwise, if every unrecognized word generated an error message, users would likely become frustrated by too many spurious error messages. Finally, as shown in steps 1540 and 1542, if the parsed string has another remaining word, then the word of the parsed string (PWORD) is set to the next remaining word (NEXT PWORD) ANNOT is reset to the first annotation, and AWORDANNOT is reset to the first word of the annotation ANNOT. Processing then continues at decision step 1504. Otherwise, (matching nodes D in FIGS. 15A and 15B) as shown in step 1550, any fragments in the set may be ordered in accordance with a ranking methodology and processing continues via return node 1560. In the embodiment of the optimized combination process discussed in § 4.4.2.3, the cliques (or sets) of pattern objects (or fragments) do not need to be rank ordered. Accordingly, any steps related to such ranking need not be performed in that embodiment. Returning back to decision step 1504, if the current word of the parsed string (PWORD) is the same as the current word of the current annotation (AWORDANNOT) then processing continues at step 1520. At step 1520, the current word of the parsed string (PWORD) will be set to the next word (if one exists) of the parsed string. If there are no more words in the parsed string, then the word of the parsed string (PWORD) is set to an end of parsed string message (PEND). Similarly, at step 1520, the current word AWORDANNOT of the current annotation (ANNOT) is set the next word (if one exists) of the current annotation (ANNOT). If there are no more words in the current annotation (ANNOT), then the word (AWORDANNOT) of the current annotation (ANNOT) is set to an end of annotation message (AEND). Processing then continues with decision step 1522. As shown in decision step 1522, if both the word of the parsed string is the last word and the word of the current annotation is the last word (i.e., if both PWORD is set to PEND and AWORDANNOT is set to AEND), then (matching nodes A in FIGS. 15A and 15B) processing branches to decision step 1528 which determines whether or not the phrase and the annotation match completely or only partially. More specifically, as shown in decision step 1528, if the first word of the phrase is the same as the first word of the annotation, then processing branches to step 1534. Otherwise, processing branches to decision step 1530. Returning to decision step 1522 of FIG. 15A, if PWORD is not set to PEND or AWORDANNOT is not set to AEND, then processing continues at decision step 1524. As shown in decision step 1524, if either the word of the parsed string is the last word or the word of the current annotation is the last word (i.e., if either PWORD is set to PEND or AWORDANNOT is set to AEND), then (matching nodes B in FIGS. 15A and 15B) processing branches to decision step 1530. At decision step 1530, it is determined whether the current annotation (ANNOT) was made manually or automatically. This determination is made because manual annotations should match the phrase exactly while partial matches between automatic annotations and the phrase may suffice if some conditions are met. Manual and automatic annotations are treated differently because it is believed that manually generated annotations will be more precise and be the product or more thought, while automatically generated annotations have more opportunity for imprecision. Naturally, the requirement for exact matching for manually generated annotations may merely be a default parameter that may be changed such that exact matching is not necessary. Similarly, the fact that partial matches may suffice for automatically generated annotations may merely be a default parameter that may be changed such that exact matches are required. If, the current annotation (ANNOT) was not generated automatically (i.e., if it was generated manually), then (matching node E in FIGS. 15B and 15A) processing branches to decision step 1540 which was discussed above. This is because the phrase of the parsed string and the current annotation do not match exactly. If, on the other hand, the current annotation (ANNOT) was generated automatically, then processing branches to decision step 1532. Decision step 1532 checks to see whether the phrase meets certain criteria related to the degree of the match or the confidence of the match. An exemplary set of criteria is set forth below. A partial match is considered valid if (i) (a) all non-class words in the annotation are present or (b) the sum of all the normalized word weights is at least 0.50, and (ii) if there are any unique words in the parsed string, at least one unique word is present. Other criteria may be used to judge partial matches. As shown in FIG. 15B, if the criteria are met, processing branches to step 1534 where a fragment associated with the annotation (ANNOT) is added to the set of fragments (FRAGSET). If, on the other hand, the criteria are not met, (matching nodes E in FIGS. 15B and 15A) processing branches to decision step 1540. Returning now to decision step 1524 of FIG. 15A, if the current word of the parsed string is not the last word of the parsed string (PWORD≠PEND) and the current word of the current annotation is not the last word of the annotation (AWORDANNOT≠AEND), then (matching nodes C in FIGS. 15A and 15B) processing branches to decision step 1526. If the current word of the parsed string (PWORD) is the same as the current word of the annotation (AWORDANNOT), then (matching nodes F in FIGS. 15B and 15A) processing branches to step 1520 to determine whether the adjacent words also match. If, on the other hand, the current word of the parsed string (PWORD) is not the same as the current word of the annotation (AWORDANNOT), then processing branches to decision step 1530 since there is only a partial match. After all of the words of the parsed string are checked against all of the words of all of the annotations, a set of fragments (FRAGSET) is analyzed. Naturally, the set of fragments may be empty if step 1534 was never reached. However, assuming that there are some fragments in the set, the fragments may be rank ordered as shown in step 1550. More specifically, as shown in step 1550, for each PWORD or phrase from the parsed query, a group of the fragments may be rank ordered based on a ranking criteria and the rank ordered fragment(s) are added to the set ORDEREDFRAGSET. For example, in the query "Russian Room", the word "Russian" may completely match an annotation to a particular row of a CuisineType column of a Cuisine table and may partially match an annotation "Russian Tea Room" to a particular row of a Restaurant column of a Restaurant table. Similarly, the phrase "Russian Room" may partially match an annotation "Russian Tea Room" to the particular row of the Restaurant column of the Restaurant table. Finally, the word "Room" may partially match a number of annotations to particular rows of the Restaurant column of the Restaurant table and a Movie column of a Movie table for example. The selection operation selects one of these fragments for each "matching" (partially or completely) word or phrase of the natural language query. One exemplary set of ranking criteria (rules) is described below. First, complete matches are ranked ahead of partial matches. Second, assuming a complete match, the fragments associated with the matches are ranked: (i) first from the highest number of words in the match to the lowest; and (ii) then from the highest prior (probability) associated with the objects to the lowest. Assuming a partial match, the fragments associated with the matches are ranked: (i) first from the highest product of the normalized weight and the prior (probability) for object references to the lowest; (ii) second from the highest number of words in the match to the lowest; (iii) third from the lowest number of matching entities to the highest; and (iv) finally from the highest normalized weight to the lowest. For each word or phrase of the parsed query, the fragments are rank ordered in accordance with the above criteria. To reiterate, in the embodiment of the optimized combination process described in § 4.4.2.3 below, the cliques (or sets) or pattern objects (or fragments) do not need to be rank ordered. However, each initial pattern may be assigned a cost. In addition, an empty pattern having an associated cost of ignoring the clique to which it belongs may be provided in some or all of the cliques. In the database schema, there may be relations that have "free text" (or "blurb") entities as their destinations. If at least part of a natural language query can be a "blurb", the blurb entity (pattern object) may appear across multiple (adjacent) cliques, just as a phrase pattern object may appear across multiple (adjacent) cliques. § 4.4.2.3 Combining Fragments (Patterns) As discussed in § 4.4.2.2 above, a phrase matcher may be used to generate groups of fragments (also referred to as "patterns"), each group corresponding to a word of the parsed string. If a pattern corresponds to a phrase (i.e., more than one word) of the parsed string, the pattern may appear across multiple groups. Each group of pattern objects may be referred to as a "clique". Although the pattern objects may be rank ordered within each clique, for example as described in § 4.4.2.2 above, they need not be so ranked. One aspect of the present invention may function to combine these pattern objects to generate a formal query. Before describing the optimized (pattern object) combination process 1230′, certain terms, which may be used below, are defined. A pattern may be a set of entity and relation variables. Each entity or relation variable of a nonempty pattern may have a distinct type. In addition each relation variable connects a source and a destination entity variable. It is worthwhile thinking of a pattern as a directed graph, in which entity variables are vertices and relation variables are edges. Each vertex and edge may be labeled with a type symbol. The source and destination of a directed edge is the source and the destination of the relation variable. The graph corresponding to a pattern is connected, meaning that any entity or relation variable can be reached from any other entity or relation variable by going through the source and destinations of relation variables. There is one exception—a pattern may also be a single relation variable with unspecified source and destination (also referred to as a "singleton relation"). The corresponding graph for such a pattern would be ill formed—an edge for which the source and destination do not exist but nevertheless, it is considered connected. Although not part of the combination process, it should be noted that patterns with this structure are typically used for graph matching. Consider a directed graph in which each vertex and each edge is labeled with a type, as for the patterns. Both vertices and edges could have additional properties. A pattern "matches" a graph if one can construct a one-to-one correspondence between entity variables and vertices and between relation variables and edges, such that the label on each vertex and edge is the same as that of the corresponding variable and such that for each relation variable, its source and destination are the source and destination vertices of the corresponding edge. Empty patterns, defined later, match anything. An entity variable that is an element of some pattern P is denoted by T(X), where T is the entity type name and X is a symbol that is unique in P. Similarly a relation that is an element of P is denoted by T(X, S, D), where T is the relation type name, X is a symbol that is unique in P, S is the symbol for the source entity and D is the symbol for the destination entity. When a symbol is not significant, it may be denoted as '_'. A pattern may be denoted by an expression [T1, . . . , Tk] where each Ti, 1≦i≦k, denotes an element of the pattern as above (the order is not significant). For example, a pattern consisting of a single entity variable of type employee could be denoted by
In addition, a pattern may have a restriction, which is presently a logical statement built from conjunction, disjunction and negation connectives, and atomic statements that are applications of predicates (equality, etc) to properties of entity variables and constants. Two examples of patterns with restrictions are:
A pattern object may include: (1) a unique identifier, which will be assumed to be a natural number; (2) a pattern, as described above; (3) a cost, which is a nonnegative number; and (4) a set of identifiers of ancestor pattern objects. The identifier uniquely identifies a pattern object so when the meaning is obvious, a pattern object, its identifier and its pattern may be discussed as if they were the same. A pattern object with an empty pattern is called a void pattern object. For an input pattern object, a cost (described in more detail later) is given. For a pattern constructed during an optimized combination process, the cost may be computed as described below. In one exemplary embodiment, costs of input patterns are in the range 1 to 10. The ancestor set reflects the set of patterns that were used to construct the pattern. For an input pattern, the ancestor set contains its only own identifier. For other patterns, the computation of the ancestor set is described below. (Sometimes, patterns of identical structure, but that were built by connecting different initial patterns and that may thus have different costs, may be constructed. The ancestor set gives sufficient information to correctly distinguish such patterns and compute their costs correctly.) A clique is a set of patterns objects, which should be thought of exclusive alternatives. As described below, an optimized combination process chooses exactly one pattern from each clique and joins the chosen patterns. Thus, a clique can be thought of as alternative but contradictory interpretations. A clique may contain at most one void pattern object. Choosing the void pattern object of a clique effectively ignores the clique (and the word of the parsed phrase with which the clique is associated) but does incur the cost of the void pattern object. Thus, referring to FIG. 16, the phrase matcher may generate data 1610 which includes cliques 1612 of pattern objects 1614. Each clique 1612 may have a different number of pattern objects 1614. For example, clique 1 1612a may have n pattern objects 1614 while clique x 1612b may have m pattern objects 1614. Further, each pattern object 1614 may include an associated cost 1616. This cost may correspond to the ranking aspect of the phrase matcher described in § 4.4.2.2 above. For example, this cost may be based, at least in part, on a degree to which a string of the parsed query "matched" the annotation associated with the pattern of the pattern object, or a word class (e.g., adverb, adjective, verb, noun, proper noun, etc.) of the string or the annotation associated with the pattern of the pattern object. Some of the cliques 1612 may have a cost 1616 for ignoring its pattern objects 1614. (Recall "void pattern objects".) This cost of ignoring may be based on a word class of the string with which the clique 1612 is associated (e.g., the cost of ignoring an adverb or an adjective may be 1, the cost of ignoring a verb may be 3, the cost of ignoring a noun may be 5, and the cost of ignoring a proper noun may be infinite). As shown in FIG. 16, an exemplary optimized combination process (or, more generally, a combination process) 1230′ may function to accept the cliques 1612 of pattern objects 1614 and a database schema 420 (See, e.g., FIG. 14.) and to generate a connected pattern consistent with the schema 420 and covering a selection of pattern objects 1614, one from each of the cliques 1612. As is further shown in FIG. 16, the exemplary optimized combination process 1230′ may consider hints 1620 when determining how to best combine patterns from the cliques 1612. The hints 1620 may include, for example, source hints 1622, destination hints 1624, proximity hints 1626, and possessive hints 1628. For example, in the queries:
As an example of proximity hints 1626, consider the query:
As an example of possessive hints 1628, consider the query:
Thus, hint sets may be generated from a syntactic analysis of the natural language query. Different hint sets may apply to different interpretations of the same natural language query. Each hint set will be coherent. For example, from the phrase:
Referring still to FIG. 16, notice that a path cost precomputation process 1630 may be used to generate entity-to-entity costs or, alternatively, nexus-to-nexus, entity-to-closest nexus, and entity-to-entity exception costs from the database schema 420. The precomputation process 1630 may be used to ease computational burdens on the optimized combination process 1230′ during runtime, as will become apparent below. Having introduced a context in which the optimized combine process 1230′ may be used, an exemplary method 1230" for effecting the optimized combine process 1230′ is now described with reference to FIG. 18. FIG. 18 is a high level flow diagram of an exemplary method 1230" which may be used to effect the optimized combine process 1230′. As will become apparent, the exemplary optimized combination method 1230" can be thought of a "best first" searching method in which costs of various states, each state corresponding to pattern objects, at least some of which are combined, are estimated and updated. After each update, the most promising (e.g., lowest estimated cost) state can be further pursued. If an updated estimated cost of the most promising state turns out to be greater than the estimated cost of a previous state, that previous state can be pursued. More specifically, referring to FIG. 18, the cliques 1612 of pattern objects 1614 are accepted (hint sets 1620 may also be accepted) as shown in block 1805. Then, as shown in block 1810, a pattern object 1614 is selected from each clique 1612 to generate a start state. Further, as shown in optional decision block 1820 and block 1825, all possible start states may be determined by selecting all possible combinations of one pattern object 1614 from each of the cliques 1612. At this point, all possible start states have been generated. As shown in block 1830, a cost heuristic may be applied to each of the start states (or more generally, "states") based on the database schema 420. An exemplary cost heuristic will be described in more detail in § 4.4.2.3.1.2.2 below. Next, as shown in block 1835, a best (e.g., lowest estimated cost) state is selected. Then, in decision block 1840, it is determined whether or not a final solution has been found. This determination may be based on determining whether the state is a single connected pattern. If a final solution has not been found, the method branches to block 1845. In block 1845, successor states of the selected state are determined. Basically, such successor states may be generated by combining (e.g., by unification or joining described in § 4.4.2.3.1.1 below) patterns of the selected state. For example, if the selected state has a number K of patterns, the successor states should each have K-1 patterns. Alternatively, more than two patterns can be combined, in which case the each of the successor states would have less than K patterns. The number of successor states may be pruned as will be described in § 4.4.2.3.1 below. As shown in block 1850, a cost heuristic, based, for example, on the database schema 420, may be applied to each of the successor states. Then an estimated cost may be applied to each of the successor states as shown in block 1855. The estimated cost may be a function of an actual (or known) cost and the heuristic (or unknown) cost. Finally, as shown in block 1860, the successor states (that have not been pruned) may be added to a state queue. Alternative methods for performing blocks 1845, 1850, 1855 and 1860 are described in § 4.4.2.3.1 with reference to FIGS. 19 and 20. After block 1860, the method branches back to block 1835 where a new best (e.g., lowest estimated cost) state is selected. Thus, the method 1230" is iterative and pursues a state that is estimated to provide the best (e.g., lowest cost) solution (i.e., a single connected pattern). Returning to decision block 1840, if it is determined that a final solution has been found (e.g., the state has a single connected pattern), then the method 1230" branches to block 1865 where a final cost is computed. As will become more apparent in § 4.4.2.3.1.2 below, the final cost may be adjusted to include any entity or relation types in the final pattern that are not type compatible with entity or relation types in the original patterns (i.e., those patterns in the cliques). Such costs are tracked but are not used when determining the definitive or known heuristic cost component of the estimated cost since it is not known whether or not these entities (and their associated costs) will remain. The method 1230" may use a predetermined time limit to determine more than one solution. Alternatively, there may be two (2) timeouts in the method 1230. First, if no solution is found within a first predetermined time T1 (where T1 may be infinite), the method "gives up". Second, once a first solution is found, the method can continue the search for a second predetermined time T2 (where T2 may be 0). In this alternative embodiment, the first predetermined time T1 should be checked on each round of the loop (e.g., just before decision block 1835), and not just when a solution is found. Thus, in this alternative, an initial time allotment of T1 is provided and after a first solution has been found, the method gets another time allotment of T2 (from the time of the first solution). As will be appreciated from the description in § 4.4.2.3.1 below, this is because some assumptions in pruning successor states, which are very useful in reducing runtime computations, may, in some relatively rare instances, cause a non-optimal solution to be generated. Therefore, if time permits (for example, if a user would not perceive or be unduly annoyed by the extra time), additional solutions may be found as indicated by decision block 1875. More specifically, the solution state and its final cost may be saved as shown in block 1870. Actually, if there is already a better (e.g., lower cost) solution state, the present solution state need not be saved. Then, at decision block 1875, if a time out period has not expired, the solution state is removed from the state queue as shown in block 1880 and the method 1230" continues at block 1835. (Although not shown, once the known costs of a state exceed the total cost of a solution, that state should no longer be pursued since it cannot be an optimal solution.) If, on the other hand, the time out period has expired, the solution with the best final cost is returned as shown in block 1885 and the method 1230" is left via RETURN node 1890. There may be the case that step 1885 finds that no solution has been saved. In such a case, the translator may revert to generating an information retrieval style of query, searching for words occurring in the input in predetermined columns. Having described one way to effect the optimized combination process 1230′, at least at a high level, exemplary methods for generating and determining costs of successor states are now described in § 4.4.2.3.1 below with reference to FIGS. 19 and 20. § 4.4.2.3.1 Generating and Determining Costs of "Successor" States Recall from blocks 1845, 1850, 1855 and 1860 of FIG. 18 that given a selected state, successor states are generated, cost heuristics are applied to each of the successor states, an estimated cost is applied to each of the successor states and the successor states are added to a state queue. FIGS. 19 and 20 are high level flow diagrams of alternative methods for effecting these acts. Each will be described in turn. FIG. 19 is a high level flow diagram of an exemplary method 1845′/1850′/1855′/1860′ for effecting the foregoing acts. For each pair of patterns in the selected stated, a number of acts are performed as shown within loop 1905-1955. First, as shown in block 1910, for the given pair of patterns of the selected state, actions (for combining) the patterns and the associated costs of such actions are determined. Actions for combining patterns may include a "unify entity" action described in § 4.4.2.3.1.1.1 below, a "unify relationship" action described in § 4.4.2.3.1.1.2 below, and a "join" (or link) pattern objects action described in § 4.4.2.3.1.1.3 below. As will be described in more detail below, in one exemplary embodiment, the cost for a unify entity or unify relation action is zero, and the cost of a join (or link) pattern objects action is based on the cost of relationships (e.g., a cost of 2 for each relationship) and entities (e.g., a cost of 1 for each entity) added to the pattern. Next, as shown in optional block 1915, these actions may be pruned. Pruning serves to reduce computations. In one exemplary embodiment, for a given pair of patterns in the selected state, only the lowest cost action(s) and actions within a predetermined range of (e.g., within 1.15 times) the lowest cost action(s) are kept—the rest are deleted. In practice, the present inventors have found that such limited pruning advantageously reduces later computations (i.e., computational complexity) while still permitting a globally optimal solution to be found. Recall from FIG. 18 that more than one solution may be found. Referring to loop 1920-1950, for each of the remaining (un-pruned) actions, a number of acts are performed. As shown in block 1925, for each remaining action, a new state having the combined selected patterns is created. Then, as shown in block 1925, a definitive (or known) cost component of an estimated cost associated with the newly created state is updated. More specifically, the definitive (or known) cost may be defined as the definitive costs of the parent plus the cost of the action. If the action was a join (or link) pattern objects, the cost of the action can ignore costs of entities and relationships in the "path" or "chain" connecting the selected patterns that are type compatible with entities and relationships in the original pattern objects (i.e., the pattern objects 1614 selected from the cliques 1612). This is because such entities and relationships, and their associated costs, may drop out later. Hence, the cost of such entities and relationships is unknown. As shown in block 1935, a heuristic cost associated with the newly generated state may be determined. Exemplary ways of effecting this act are described in § 4.4.2.3.1.2.2 below. Then, as shown in block 1940, an estimated cost associated with the state, which is based on the definitive (or known) cost and the heuristic (or unknown) costs, is determined. In one exemplary embodiment, the estimated cost may be the sum of the definitive costs and the heuristic costs. The newly generated state (and its associated estimated cost) may then be added to a state queue as shown in block 1945. After all remaining actions are processed with in the loop 1920-1950, a next pair of patterns in the selected state are determined and processed in the loop 1905-1955. After all pairs of patterns in the selected state are processed, the method 1845′/1850′/1855′/1860′ is left via RETURN node 1960. FIG. 20 is a high level flow diagram of an alternative exemplary method 1845"/1850"/1855"/1860" for effecting the associated acts in FIG. 18. Basically, this alternative exemplary method uses a different pruning scheme than that used in the exemplary method illustrated in FIG. 19. Referring now to FIG. 20, at decision block 2005, it is determined whether or not any patterns of the selected state have one or more "nexus". A nexus may be defined as an entity which is the source of or destination to a relatively large number of relationships. Nexuses may be manually selected or identified, may be defined as "having" greater than a predetermined number of relationships, or may be defined as being within a top predetermined percentile of entities "having" relationships. If any patterns of the selected state have one or more nexus, a pattern(s) of these patterns with a minimum cost is selected as shown in block 2015. Otherwise, if none of the patterns of the selected state have a nexus, a minimum cost pattern of all of the patterns is selected as shown in block 2010. The present inventors believe that this pruning technique may be useful because it favors pursuing patterns with one or more nexus first. Such patterns are pursued first since the inventors believe that the presence of one or more nexus in the pattern will permit the other patterns of the selected state to be added with less expense later. In any event, referring once again to FIG. 20, the acts within the loop 2020-2055 are performed for each pair of patterns in the selected state, where each such pair of patterns includes the pattern(s) selected in block 2010 or 2015. For a given pair of patterns, as shown in block 2025, actions (for combining) the patterns and the associated costs of such actions are determined. To reiterate, actions for combining patterns may include a "unify entity" action described in § 4.4.2.3.1.1.1 below, a "unify relationship" action described in § 4.4.2.3.1.1.2 below, and a "join" (or link) pattern objects action described in § 4.4.2.3.1.1.3 below. As will be described in more detail below, in one exemplary embodiment, the cost for a unify entity or unify relation action may be zero, and the cost of a join (or link) pattern objects action may be based on the cost of relationships (e.g., cost of 2 for each entity) and entities (e.g., cost of 1 for each entity) added to the pattern. Then, as shown in block 2030, a new state having the combined selected patterns is created. Then, as shown in block 2035, a definitive (or known) cost component of an estimated cost associated with the newly created state is updated. To reiterate, the definitive (or known) cost may be defined as the definitive costs of the parent state plus the cost of the action. If the action was a join (or link) pattern objects, the cost of the action should ignore costs of entities and relationships in the "path" or "chain" connecting the selected patterns that are type compatible with entities and relationships in the original pattern objects (i.e., the pattern objects 1614 selected from the cliques 1612). As shown in block 2040, a heuristic cost associated with the newly generated state may be determined. Exemplary ways of effecting this act are described in § 4.4.2.3.1.2.2 below. Then, as shown in block 2045, an estimated cost associated with the state, which is based on the definitive (or known) cost and the heuristic (or unknown) costs, is determined. To reiterate, in one exemplary embodiment, the estimated cost may be the sum of the definitive costs and the heuristic costs. The newly generated state (and its associated estimated cost) may then be added to a state queue as shown in block 2050. After all pairs of patterns, including a selected pattern, of the selected state are processed in the loop 2020-2055, the method 1845"/1850"/1855"/1860" is left via RETURN node 2060. Recall from FIGS. 19 and 20 that for a given pair of patterns, actions for joining those patterns, and their costs, may be determined. Exemplary ways to combine pattern objects are now described in § 4.4.2.3.1.1 below. § 4.4.2.3.1.1 Ways to Combine Patterns (Fragments) As introduced in § 4.4.2.3.1 above with reference to FIGS. 19 and 20, pattern objects may be joined by unifying entities in the pattern objects, unifying relationships in the pattern objects, or joining pattern objects (e.g., via a path including entities and relationships that is consistent with the database schema 420). Basically, each pattern (fragment) is a collection of (one or more) objects (e.g., at least a part of entity tables or property tables) that are related to one another in accordance with the database design schema 420 (e.g., the ERD). All patterns have a "type". For example, referring to the Prolog file of Exhibit A, each column of each table has a defined type. In the exemplary ERD of Exhibit A, the types include: addressType, admissionFlag, cinema, cinemaType, city, cuisineType, date, entity, entityFlag, floating point, genre, hoursType, integer, movie, movieFlag, neighborhood, neighborhoodID, parkingtype, paymenttype, personPlace, personPlaceFlag, phonetype, price, quality, rating, reservation, restaurant, restaurantflag, stars, starsID, string, time. As is evident from Exhibit A, different columns of different tables can be of the same type. Each object may be either bound (i.e., confined to a particular set of rows of its type) or unbound (not confined). As shown in step 1230, patterns (or fragments) of the cliques (or groups) 1612 are "combined" so that a formal command query may be generated for interpretation by the database management process 470. Basically, the combination process combines the patterns in each of the groups (or cliques) in a way consistent with the annotated database design schema 420 (e.g., the annotated ERD) and in a least costly way. Once the pattern objects 1614 from each of the cliques 1612 are combined, the resulting connected pattern can be easily converted to a formal command query. Unifying entities in patterns is described in § 4.4.2.3.1.1.1 below. Then, unifying relationships in patterns is described in § 4.4.2.3.1.1.2 below. Finally, joining patterns (e.g., via a path that is consistent with the database schema 420) is described in 4.4.2.3.1.1.3 below. § 4.4.2.3.1.1.1 Unifying Entities of Patterns Consider two patterns P1 and P2, where P1 contains an entity of type T1 with symbol v1 and P2 contains an entity of type T2 with symbol v2, such that T1 and T2 are unifiable (e.g., min (T1, T2) exists, which in turn means that either T1 and T2 are the same, or one is (transitively) a subtype of the other). P1 and P2 can then be connected by unification. The result of connecting the patterns is then a single pattern in which the two terms of types T1 and T2 have been merged to a single term of type min (T1, T2). For example, the two patterns
If P1 and P2 are connected by unification of a pair of entity variables T1(v1) in P1 and T2(v2) in P2, then the pattern of P is that expressed by (P1\T1(v1))[v/v1]\(P2\T2(v2))[v/v2]\{T(v)}, where v is a new symbol and T is min(T1, T2). (E[v/w]means the expression resulting from replacing with v each free occurrence of w in E.) Potential generalizations of this form of unification include unifying three or more patterns at the same time, unifying patterns on more than one variable, and unifying two variables in the same pattern. An alternative method may unify complete patterns. More specifically, if two (2) patterns are unified, all objects in those patterns are unified, thereby combining the two (2) patterns to define a single pattern. To put it another way, to unify two (2) patterns, there must be at least one (1) pair of objects (i.e., at least a part of an entity table or a property table) that can be coerced to refer to the same object. More specifically, two (2) objects can be coerced to refer to the same object if both objects are of the same type or of a more general type. The following is an example of unifying two (2) patterns having objects of the same type. In the query "I'd like to eat Mexican food", the word "eat" matches annotations to patterns, including the pattern having the object "cuisineType(variable—1)". In addition, the word "Mexican" matches annotations to patterns, including the pattern having the object "cuisineType(Mexican)". Finally, the word "food" matches annotations to patterns, including the pattern having the object "cuisineType(variable—2)". Since these patterns contain objects (in this case, entities) of the same type, namely, cuisineType, they are combined to form the unified pattern having the object (in this case, entity) cuisineType(Mexican). The following is an example of unifying patterns where a first pattern has an object which is of a more general type and the second pattern has an object which is of a more specific type (i.e., the two (2) objects belong to tables related by an "IS A" relation). In the query "Restaurants named Gabriel's", the word "restaurant" matches annotations to patterns, including the pattern having the object "restaurantID(variable—1)". Further, the word "Gabriel's" matches annotations to patterns, including the pattern having the object "personPlaceID(n)", where "n" is a number associated with "Gabriel's". Since the personPlaceID table is a supertype of the restaurantID table, i.e.,
Two (2) objects that are unbound (i.e., have unspecified or variable values; not constrained to particular a row(s)) can be coerced into a single unbound object. For example, in the query "Eat food", the word "eat" matches annotations having patterns, including the pattern having the object cuisineType(variable—1). Further the word "food" matches annotations having patterns, including the pattern having the object cuisineType(variable—2). Since the object types are the same (tables having a "IS A" relation can also be unified in this way), the variables are set to a single variable. That is, the patterns, when unified, form a pattern having the ob | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
