Dictionary

Method and apparatus for efficient morphological text analysis using a high-level language for compact specification of inflectional paradigms

5475587

Abstract

A language syntax defines statement forms for statements that describe the inflectional morphology of the natural language. A set of language statements that follow a language syntax to describe the inflectional morphology of the natural language is accepted as input into to a computer. Rule statements define a set of morpho-syntactic features corresponding to grammatical distinctions within the parts of speech categories in the natural language and define a set of inflectional morphological paradigms. The inflectional morphological paradigms include form rule statements to describe the construction of word forms and associate with each construction pre-selected morpho-syntactic features.


Claims

What is claimed is:

1. For use in computer-based morphological text analysis of natural languages, a computer implemented method for creating a data structure for computer-based generation and recognition of word forms in a natural language, the computer implemented method comprising the steps of:

a. providing a morphological description of a natural language, the description comprising statements in a morphological description language, the morphological description language comprising statements arranged according to a pre-determined syntax, the syntax permitting the specification of inflectional morphologic paradigms, the morphologic paradigms comprising form rules including surface form rules and intermediate form rules, the form rules comprising a left-hand-side identifier and a right-hand-side specifying a word stem and, optionally, the concatenation or removal of an affix, including a prefix or a suffix, the stems comprising the identifiers of other form rules or form sets, or a keyword, said keyword being either a keyword LEX or a keyword NIL, the affixes comprising strings of characters or the identifier or an affix variable, the syntax capable of specifying that the form rules of one morphological paradigm are inherited by another morphological paradigm, the syntax permitting the stem in a form rule to be an indicator to a string in a lexicon, the syntax permitting the stem in a form rule to be an indicator that the form rule is not used in the given paradigm via the keyword NIL, the syntax permitting a form set identifier to represent a plurality of left-hand-side form rule identifiers and the form set identifier to be used as the stem in the right-hand-side of a form rule, the syntax permitting an affix variable to identify a set of affix strings with the affix variable being used as an affix in a right-hand-side of a form rule, said morphological description stored in a memory device;

b. disambiguating the stem components of the right-hand-sides of the form rules in each paradigm, the disambiguation process comprising the steps of:

i. determining in each form rule whether the stem component is an identifier of another form rule;

ii. replacing each stem component that is an identifier with a link to the identified form rule;

iii. determining in each form rule whether the stem component is an identifier in a form set;

c. determining for each paradigm whether there is a declaration stating that the paradigm inherits the form rules of another parent paradigm;

d. creating form rules for the paradigms that will inherit the form rule from a parent paradigm by sharing references to the form rules of the parent paradigm;

e. replacing, for each form rule that contains a right-hand-side reference to a form set, the form rule with a set of form rules, one for each form in the corresponding form set, each created form rule corresponding to the form set rule containing the right-hand-side reference to the form set;

f. checking each surface form for cycles, the cycle check process comprising the steps of:

i. creating a cycle check list initialized to empty;

ii. locating a surface form rule;

iii. checking stem components on the right-hand-side to determine if the stem is an identifier to another form rule;

iv. comparing the stem that is an identifier of another form rule to the entries on the cycle check list;

v. adding the stem that is an identifier to the cycle check list unless the identifier is included in the cycle check list;

vi. checking the form rule referenced by the identifier for cycles;

g. providing a set of orthographic rules; and

h. conflating the set of orthographic rules, the process of conflation comprising the steps of:

i. finding the set of form rules that match one of the orthographic rules in terms of an operator, an affix and an affix type;

ii. creating an inner form rule variant, the form rule variant comprising the stem form rule from the right-hand-side of the matching form rule as the right-hand-side stem and as the affix, an affix sequence comprising character strings and string variables, indicating the correct context determined by the orthographic rule, and as the operator a minus; and

iii. creating an outer form rule variant, the outer form rule variant comprising the newly created outer form rule as the right-hand-side stem and as the affix, an affix sequence comprising character strings and string variables, indicating the correct spelling as determined by the orthographic rule and as the operator a plus.

2. For use in computer-based morphological analysis of natural language text, a computer implemented method for generating surface forms of a word, the computer-implemented method comprising the steps of:

a. providing a lexicon, the lexicon comprising a plurality of word entries, the word entries containing data on each word in a language, including a citation form of the word and a paradigm, with which each lemma is associated, said lexicon stored in a memory device;

b. providing a computer-manipulatable data structure containing information concerning the grammatical form construction rules, paradigms and orthographic rules of a natural language, the form rules comprising a left-hand-side (LHS) comprising an identifier and a right-hand-side (RHS) comprising a word stem, an operator, and an affix, the operator indicating an operation to concatenate or remove the affix, the word stem comprising a reference to another form rule, a LEX indicator to signal that the form string is stored in the lexicon, or a NIL indicator to signal that the form is not used in the current paradigm, each form rule having associated with said form rule a set of pre-selected morpho-syntactic features, each form rule capable of being referenced in pre-selected ones of the grammatical construction rules, the orthographic rules being comprised of variables showing the context and result of the rules applicable, with the variables associated with a set of values over which the variables may range, where each orthographic rule applies there being an outer variant form rule to cover the orthographic rule applied to the particular form rule, the computer-manipulatable data structure comprising a set of interconnected nodes, the nodes containing information on the form construction rules, their related paradigms and the orthographic rules and associated sets and variables, said computer-manipulated data structure being stored in the memory device;

c. accepting as input at an input device, data identifying a lemma and the name of the desired form with which is associated a set of features, or morpho-syntactic properties;

d. locating in the computer-manipulatable data structure stored in the memory device, the form rule in the paradigm associated with the specified lemma that corresponds to the specified form;

e. generating intermediate forms for the right-hand-side stems for the form rule;

f. locating in the computer-manipulatable data structure stored in the memory device any variants associated with the form rule, the variants comprising form rules representing orthographic variations on the construction of the form rule;

g. locating in the lexicon entry for the given lemma an indicator to determine if the orthographic rule applies;

h. applying the affix of the variant according to the operator, if one or more stem strings for the variant's stem were successfully generated;

i. applying the affix of the (non-variant) form rule according to the operator, unless the variant form rule is applicable; and

j. returning, at an output device, the generated form to a user.

3. A computer implemented method for performing computer-based morphological analysis of words in a natural language, the method utilizing a computer, the computer implemented method comprising the steps of:

a. providing a syntax for a description of the inflectional morphology of a natural language, said description comprising a set of statements made according to the syntax, said syntax stored in a memory device;

b. accepting, at an input device, as input to the computer a set of statements for the description of the inflectional morphology, the set of statements specified according to the syntax;

c. creating a computer-manipulatable data structure, using the set of statements made according to the syntax, the data structure comprising a set of interconnected nodes, the nodes comprising information on the statements of the natural language and the nodes being linked by a hierarchical structure and a plurality of interconnecting references; and

d. performing morphological analysis using the computer-manipulatable data structure, the morphological analysis comprising inflectional operations as words found in a natural language.

4. A morphological text analyzer for inflection operations for manipulating word forms in a natural language,the morphological text analyzer comprising:

a computer processor;

a memory storage device coupled to the computer processor;

a computer-manipulatable data structure stored in the memory storage device comprising a hierarchical tree with interconnected nodes, the nodes containing computer-manipulatable information concerning the inflectional morphology of a natural language, wherein the computer-manipulatable data structure was created from a mapping of a high-level description of the inflectional morphology of a natural language, the high-level language comprising a syntax to provide for the specification of inflectional paradigms, the mapping being a compiling process to transform a set of statements of the high-level language into the computer-manipulatable data structure;

a lexicon stored in the memory storage device comprising a set of word entries each entry comprising inflection information for a word of the natural language; and

means, executed by a computer processor, for performing morphological analysis by accessing and manipulating the data structure and the lexicon stored in the memory storage device.


Description

FIELD OF THE INVENTION

This invention relates generally to the fields of information management and data processing and, more particularly, to the field of natural language processing, applied in information management and data processing systems.

BACKGROUND OF THE INVENTION

There is great interest in applying the principles of morphology to problems in natural language processing in the fields of information management and data processing. Morphology is that part of linguistic study which is concerned with the formation of words in a natural language (such as English, French, German or Russian). It is the study of what people do everyday to form words in speech and writing. Of interest is how words are formed to provide meaning in different grammatical contexts.

People communicate by organizing words into larger structures like phrases, clauses, sentences and paragraphs, according to the accepted grammatical, syntactic and spelling conventions of a given natural language. The grammar of a language, groups words into categories that denote the principal parts of the language (or parts of speech), such as nouns, verbs, adjectives and adverbs. Within each part of speech category, a word can have one of several possible inflections, each marking distinctions in use such as gender, number, tense, person, mood or voice. A word, of course, can be used in more than one part of speech. When a word is used in a sentence, it appears in an inflected form that evokes the general meaning of the word and also marks some of the grammatical distinctions in use. For example, the idea of existence or "being" is conveyed in the different forms of the verb "to be". In these sentences:

I am tall.

She is taller.

We are the tallest people in the room. the different inflicted forms of the verb "to be" convey the idea of "being", tailored to the different subjects (e.g., first-person singular, second-person singular, first-person plural) and a tense (e.g., present tense) of the verb. Thus, a word is represented in a language through the collection of inflections, used to denote meaning as applied in a particular grammatical context. Some languages are more highly inflected than others. In English for example, word order (i.e., syntax) is a dominant factor in determining meaning, and, correspondingly, English words appear in only a relatively few number of inflected forms. In comparison, languages such as German place less importance on word order and, correspondingly, rely more on word form. Words in those languages have many more inflections. Inflectional morphology is the study of the different ways in which word forms are constructed in each language.

Morphology employs terms to formally describe the systems of construction that people use to form words in a natural language. Each construction of a given word is called a form, and a form encountered in a passage of text is called a surface form. The entire set or family of forms for a given word is called a lemma. Each lemma is identified by one form from its family (such as the form commonly found in a dictionary) called a citation form (and sometimes called a canonical form).

The words in any language are formed according to an inflectional morphology, which is defined to be the alteration of a basic word form to express the grammatical distinctions of a word within its part of speech category. Alteration is generally defined to mean the concatenation of one or-more affixes to a primary or base word form called a stem. A stem is a group of characters, which roughly corresponds to the citation form of a lemma and evokes the primary meaning of the word. Affixes are a small, closed class of character strings, which are added to a stem either as a prefix or a suffix. Affixes have a systematic grammatical effect, when combined with a stem, that provides an indication of that word form's part of speech category and inflectional characteristics. The indications that denote part of speech and specific inflectional characteristics are called the morpho-syntactic properties of a word.

When people speak or write in a natural language, they instantly form words for a given grammatical context, because they have learned (usually as children in grammar school) that most words inflect according to a number of grammatical construction and spelling rules. The rules for constructing word inflections are systematic and generally apply to a large number of words within a grammatical category. Commonly available grammar books tend to group the words which inflect by the same rules, such as noun declension and verb conjugation groups, often illustrating the rule with examples of a typical word, such as this exemplary description of the rule for French second conjugation verbs:

Rule:

To write the present tense of French verbs whose infinitive form ends in "ir" (such as the verb reussir), take the stem (reuss) and add the following endings:

    ______________________________________
    1st person singular
                       je reuss.vertline.is
    2nd person singular
                       tu reuss.vertline.is
    3rd person singular
                       il/elle reuss.vertline.it
    1st person plural  nous reuss.vertline.issons
    2nd person plural  vous reuss.vertline.issez
    3rd person plural  ils/elles reuss.vertline.issent
    ______________________________________


See, e.g., Nebelad and Frederich, French Grammar, Monarch Press, (New York, 1971). A grammar book table, such that above, includes the construction of an intermediate form stem from the infinitive form of the verb (reuss), a set of suffixes to add to this intermediate stem, and a set of grammatical features (morpho-syntactic properties) that correspond to the resulting word forms. A system of grammatical construction rules that apply to a number of words, such as the group of French second conjugation verbs, is called an inflectional paradigm. By associating a word with an inflectional paradigm, people can create inflected forms for that word, using its construction rules.

In addition to inflectional paradigms, word formation in any language is also governed by a number of orthographic rules. These spelling rules differ from the inflectional paradigms for grammatical construction, because their application is not based on any grammatical purpose. Grammatical construction rules, such as the paradigm to conjugate the present tense forms of the French verb reussir shown above, aid in forming inflections with distinct morpho-syntactic properties. Orthographic rules, on the other hand, are used to make a word conform to certain phonetic or other conventions, apart from grammatical considerations. Moreover, orthographic rules may apply across different grammatical paradigms.

For example, when adding an "s" to a word ending in "y" in the English language (e.g., "fly" or "army"), there is a spelling rule to follow: change the "y" to "i" and add "es" (e.g., "flies" or "armies"). That rule generally applies in both pluralizing nouns and conjugating verbs. In the French language, an orthographic rule governs the phonetic softening of the letter "c" as it appears in words, at times replacing the "c " with a cedilla "c". If the "c" "n" appears in a French word form and is followed by the letter "a", "o" or "u", a cedilla must be used, as in the noun form le garcon and the present tense first person plural form of the verb nous placons. In such cases, both the inflectional paradigm form construction rule and the orthographic rule govern the formation of the inflected form.

In any language, some word forms are exceptional and do not inflect according to any paradigm. People simply learn these irregular inflections, as for example, the conjugations of the present tense forms of the French language verb etre:

    ______________________________________
    1st person singular  je suis
    2nd person singular  tu es
    3rd person singular  il/elle est
    1st person plural    nous sommes
    2nd person plural    vous etes
    3rd person plural    ils/elles sont
    ______________________________________


In English, everyone knows that nouns are pluralized by adding an "s", but there are many fish, many sheep and many other groups of words to which neither a grammatical paradigm nor an orthographic rule applies.

People who are familiar with a natural language are generally skilled in using the inflectional paradigms and spelling rules (usually learned in grammar school). They are quick to recognize and form words within a grammatical context by manipulating inflectional paradigms and spelling rules and, when necessary, remembering the exception forms. For example, when a person comes upon a word in a book that she does not know, she must manipulate the word form found in the text, and determine a corresponding citation form, in order to look up the word in a dictionary. Even an unabridged dictionary for a given language does not list and define every inflection of every word. Rather, dictionaries generally provide a single entry for each word, identifying each entry with only the citation form. For example, the citation form for English verbs is the infinitive form; for English nouns, it is the singular form. With each citation, the dictionary provides, in addition to the citation form, information on a word, such as its part of speech characteristics, a definition, and some inflectional variants of the word. Thus, a dictionary is a source for all basic forms of a word, but only the basic forms. Morphologists define a lexicon to be a source of basic word forms, such as an unabridged dictionary. With a lexicon and a set of paradigms a person can inflect all of the possible word forms. People can inflect forms for even unfamiliar words, if they know that the unfamiliar word inflects according to a known paradigm.

Moreover, people easily remember and use the grammatical construction and orthographic rules, because the construction rules they have learned capture the linguistic generalities of a natural language. Orthographic rules are not specific to inflectional paradigms, but, rather, apply across paradigm boundaries. In some cases, a given orthographic rule applies only to a subset of all the words which otherwise fit the context specified by the orthographic rule. In such cases, information on whether particular inflections adhere to an orthographic rule must be remembered (or looked up in a dictionary). Additionally, the concept of word formation using inflectional paradigms, based on the taking of a base form of a word (such as a citation form) and creating an intermediate stem form to which affixes are added, applies across all paradigms. Those rules for constructing forms can be reused, as a general matter, to generate forms within the same paradigm, and sometimes in different paradigms. Thus, by reducing the myriad of inflected forms down to a relatively small number of form construction techniques, applicable across different paradigms and orthographic rules, the linguistic generalities inherent in any natural language may be captured in a way that helps people master the language.

In the fields of information management and data processing, applications have been developed to perform morphological analysis on natural language text, attempting to replicate the functions that people do every day. There are many applications where the ability to manipulate word forms while maintaining data on the morpho-syntactic properties as people do is important and useful. For example, morphological analysis has been used in information retrieval systems, where the words of a text file are indexed by the citation form of a word or some partial stem. In those systems, a text file search is performed by deriving a citation form or stem form of the word, and searching for entries under that citation form index. Manipulation of word forms is also important in on-line dictionary and thesaurus systems, where a word form is provided and a dictionary entry is retrieved by generating an associated citation form or other lemma indicator. In other systems, tracking the morpho-syntactic properties of a word is also important. For example, in database systems that permit queries in a natural language (where the meaning of the words in a query is important), morphological text analysis has been used to process them by taking each surface form found and generating a citation form or lemma identification for the word and the corresponding morpho-syntactic properties. Maintenance of morpho-syntactic information is also important in machine translation systems (in which text passages are translated from one natural language to another), where a surface form in one language is found in a passage and translated to another language in a form having the same morpho-syntactic properties.

Generally, a computer-based system for morphological analysis of text begins with a lexicon (i.e., a set of base word forms) associated with a set of form construction rules. The hope is to build a system to do some of the things people do almost automatically: inflectional recognition and inflectional generation. Inflectional analysis is the process of taking a word form as it appears in text (i.e., a surface form), recognizing or deriving the morpho-syntactic properties of that form, and locating its family of word forms (i.e., lemma). Inflectional generation is the process of generating a particular surface form with specific morpho-syntactic properties, given a citation form in a lexicon and a set of construction rules. Additionally, there has been effort to design in an elegant way the system for morphological analysis. If possible, the design of the morphological system should support multiple natural languages. Additionally, the programming or specification of the inflectional language should be easily accomplished. If possible, the specification should follow as closely as possible the easily understood system of using grammatical rules and spelling rules, (found commonly in grammar books) and applying them to basic forms found in a lexicon source (usually the dictionary). That way anyone familiar with a natural language, not exclusively those versed in computer programming techniques, could specify a description of a language's inflectional morphology and build a system. Additionally, if the method for specifying a description of the inflectional morphology and the lexicon of a natural language permitted a compact description, there would be no maintenance problem in adding new words to the lexicon and associating them with appropriate grammatical and spelling rules. However, the currently available systems for morphological analysis present incomplete and non-elegant solutions to the problem of providing morphological text analysis in information management and data processing systems.

In terms of functionality, some of the currently available systems cannot be used for both inflectional recognition and inflectional generation. And in terms of system design, the currently available systems generally use non-intuitive, highly contrived formalisms to describe the inflectional morphology of a given natural language, that often fail to take advantage of the linguistic generalities that are inherent in the language, resulting in systems that map poorly to the morphological specifications found in grammar books. Such systems can have serious practical drawbacks. For example, some of the currently available systems for morphological analysis incorporate a design that does not use paradigms. Without explicit paradigms, some valuable linguistic generalizations are lost and updating the lexicon with additional words may become a burdensome process. Other systems employ paradigms which consist of highly artificial or limited construction rules. These systems have serious practical drawbacks in that they do not incorporate features to take advantage of linguistic generalizations inherent in the language.

As discussed above, people remember paradigms for grammatical construction separate from orthographic rules that sometimes apply across paradigms. Moreover, because many paradigms differ in construction rules for only a few forms, paradigms can be arranged in hierarchies where the form construction rules of a "general" paradigm can be reapplied in other paradigms with certain exceptions. Thus, paradigms could be specified to inherit form rules of other paradigms. The use of features such as the specification of Orthographic rules, separate from the paradigm, and paradigm inheritance of form construction rules, would enable a system to be created that would map well to grammar books, and would permit compact and easy specification and updating by anyone familiar with the language. However, to date the presently available systems have yet to provide an elegant solution. For example, even in those currently available systems where orthographic rules are separated from form construction rules, the user must either explicitly specify links to the orthographical rules or transform each lexical entry into some equivalent, but highly artificial, lexical representation to provide a link to an orthographic or spelling rule--both non-trivial burdens on users who maintain the system and populate it with new words.

Moreover, some of the design choices used in presently available systems prevent them from being used with multiple languages. In some systems, an approach is taken for a specific language like English (which is not highly inflected) that is useless in creating a system for languages like Russian or German (which are both highly inflected). In other systems, the syntactic and grammatical features are hard-coded so that the system must be reprogrammed using a computer programming language each time that system is argumented to process words in a different natural language.

Thus, there is a need for a system to perform morphological text analysis which provides the functionality of both inflectional recognition and inflectional generation and also incorporates design features that would take advantage of the linguistic generalities found in many languages. The creation of such a system would be a useful advance to the field of information management and data processing.

SUMMARY OF THE INVENTION

The present invention provides a method and apparatus to perform morphological text analysis functions, including inflectional generation and inflectional recognition, by incorporating a design that supports multiple natural languages and enables users to specify the language description in compact terms that map well to grammar book descriptions. The invention comprises a number of components, including an M language syntax, used in specifying a description of the inflectional morphology of a natural language, a compiler module, used by a computer processor to transform the user-specified language description into a data structure that is used by the computer to perform the functions of morphological analysis, and a pair of generation and recognition modules, used to perform the functions of morphological analysis, using the compiled data structures and lexicon of the natural language. The present invention can be arranged to perform the functions of morphological analysis as part of a larger information management or data processing system such as, for example, in a text retrieval system or machine translation application.

In a rule specification phase, the present invention provides the user with the ability to specify a description of a natural language, using the syntax of the M language. The syntax of the M language takes advantage of the linguistic generalities inherent in a natural language, permitting the user to express rule-sharing relationships between inflectional paradigms, express generalizations over word forms, as well as share separately declared orthographic rules between paradigms. The rule statements can be specified by the user in a compact way that maps well to the language descriptions commonly found in grammar books. With the present invention, maintaining the lexicon and the language description is not a burdensome process, and can be performed easily by one who is acquainted with language. In a compilation phase, the compiler module, comprising executable program statements and data structures, enables a computer processor to read the compact, textbook-like language description and produce from it a data structure that can be used for computer-based morphological processing. In a process that is invisible to the user, the compiler module integrates the language constructs by arranging the form rules of each paradigm into a hierarchical transition network through the processes of "disambiguation," and "form rule inheritance" which expands form sets into equivalent sets of rules, making entry points into the corresponding lexicon from lexical stem references in the form rules and "conflating" possible spelling form variations into the form rules after matching the form rules to the orthographic rules based on operator affix and context. The process of "conflating" involves the automatic creation of additional variant form rules to map the application of a spelling rule to a given form construction rule. Additionally, the compiler module enables the computer processor to construct a discrimination net which is used in the process of morphological recognition.

The functions of inflectional recognition and generation are performed by the recognition and generation modules of the present invention: each is comprised to use the data structure of the described rules and the lexicon. All aspects of the present invention will be described in detail below.

BRIEF DESCRIPTION OF THE FIGURES AND APPENDICES

FIG. 1 Depicts an exemplary embodiment of the data structures and computer program modules of the present invention as used in conjunction with a text retrieval system;

FIG. 2A Depicts an overview of an exemplary process flow for the compiler module of the present invention;

FIG. 2B Depicts an exemplary process flow for the right-hand-side "disambiguation" and chaining functions of the compiler module of the present invention;

FIG. 2C Depicts an exemplary process flow for a form rule inheritance feature of the compiler module of the present invention;

FIG. 2D Depicts an exemplary process flow for a form rule inheritance processing feature of the compiler module of the present invention;

FIG. 2E Depicts an exemplary process flow for a form rule divergence check feature of the compiler module of the present invention;

FIG. 2F Depicts an exemplary process flow of a cycle check and feature propagation function of the compiler module of the present invention;

FIG. 2G Depicts an exemplary process flow of an orthographic rule "conflation" function of the compiler module of the present invention;

FIG. 2H Depicts an exemplary process flow of a discrimination net building function of the compiler module of the present invention;

FIG. 2I Depicts an exemplary process for the addition of an affix string into a discrimination net according to the present invention;

FIG. 2J Depicts an exemplary process flow of a discrimination net terminal node addition function of the compiler module of the present invention;

FIG. 2K Depicts an exemplary process flow for a procedure for the addition of form rule variant into a discrimination net according to the method of the present invention;

FIG. 2L Depicts an exemplary process plan for a process to add a string element of an affix sequence variable into a discrimination net according to the method of the present invention.

FIG. 3A Depicts an exemplary overall process flow of a morphological recognizer module of the present invention;

FIG. 3B Depicts an exemplary process flow for the discrimination net processing function of the morphological recognizer module of the present invention;

FIG. 3C Depicts an exemplary process flow for a stem rule application function of the morphological recognizer module of the present invention;

FIG. 3D Depicts an exemplary process flow for a form rule application function of the morphological recognizer module of the present invention;

FIG. 3E Depicts an exemplary process flow of an affix application function of the morphological recognizer module of the present invention;

FIG. 3F Depicts an exemplary process flow of the affix variable application function of the morphological recognizer module of the present invention;

FIG. 3G Depicts an exemplary process flow for an affix sequence application procedure of the morphological recognizer module of the present invention;

FIG. 3H Depicts an exemplary process flow for an affix sequence matching procedure of the morphological recognizer module of the present invention;

FIG. 3I Depicts an exemplary process flow for an affix sequence variable matching procedure of the morphological recognizer module of the present invention.

FIG. 4A Depicts an exemplary overview process flow of the form rule generation process of the morphological generator module of the present invention;

FIG. 4B Depicts an exemplary process flow of the procedure for stem string generation of the morphological generator module of the present invention;

FIG. 4C Depicts an exemplary process flow for the process of variant form rule string generation of the morphological generator module of the present invention;

FIG. 4D Depicts an exemplary process flow for the process of affix application to stem strings of the morphological generator module of the present invention;

FIG. 4E Depicts an exemplary process flow for the process of affix string generation of the morphological generator module of the present invention;

FIG. 4F Depicts an exemplary process flow for the process of surface string generation of the morphological generator module of the present invention;

FIG. 4G Depicts an exemplary process flow for the process of affix sequence generation of the morphological generator module of the present invention; and

APPENDIX I Lists an exemplary formal syntax for the M language of the present invention;

APPENDIX II Lists an exemplary set of user prepared declarative rules for the German language;

APPENDIX III Lists an exemplary set of user prepared declarative rules for the French language;

APPENDIX IV Lists an exemplary algorithm for the compilation module of the present invention;

APPENDIX V Lists an exemplary algorithm for the morphological recognizer module of the present invention; and

APPENDIX VI Lists an exemplary algorithm for the morphological generator module of the present invention.

DETAILED DESCRIPTION

FIG. 1 depicts an exemplary embodiment of the data structures and computer program modules of the present invention. A computer system is provided comprising a computer 2 coupled to an input device 4, such as a keyboard, and a display device, such as a screen monitor 5. The computer 2 comprises a processor 6 that is coupled to the input device 4, the screen monitor 5, and two storage devices, a main computer memory 8 and a secondary storage device 10. The processor 6 is configured to accept input from the input device 4, store input in either computer memory (8 or 10), and create and execute morphological analysis programming, using the data structures and program modules provided by the present invention. Although the present invention can be implemented on any computer system, in an exemplary embodiment of the present invention, a VAX 6000-410, manufactured by the Digital Equipment Corporation, configured to process using the VMS operating system, also manufactured by the Digital Equipment Corporation, provides the platform for the morphological text analyzer of the present invention. For more information on the 6004-010 and the VMS operating system, the reader is referred to the following publications: VMS User's Manual, Order Number: AA-6A 98B-TE, VMS Media and Extended Documentation Kit, Order Number: QA-001AA-H5, Digital Equipment Corp., Maynard, Mass.

Although the data structures and computer program modules of the present invention will be described in detail below, a brief overview of the system can be described as illustrated by FIG. 1. The present invention provides three operation phases: a rule specification phase, a compilation phase, and a run-time phase. In the rule specification phase, a user 14 employs the syntax of an M language 12 to specify, using the input device 4, the description of the inflectional morphology of a natural language, such as German or French. Additionally, the user stores in a database the word information entries that comprise a lexicon 22 that will be used in conjunction with the description of the inflectional morphology. The features of the M language 12 and the lexicon 22 are described in detail below. As input comprising statements of the M language 12 is accepted by the processor 6, it is first stored as a set of user prepared declarative rules 16 in the main computer memory 8 (while the user completes or edits his or her specification) and then stored in the secondary storage area 10 (until the compilation phase). The word forms and accompanying information contained in the lexicon 22 are also stored in the main computer memory 8 (during the specification phase) and then moved to secondary storage 22 (until the run-time phase).

In the compilation phase, a compiler module 18, comprising executable program statements and data structures, is used by the processor 6 to read the set of user prepared declarative rules 16 and produce from it a data structure comprising a compiled rule set 20. This data structure is used by the processor 6 in the run-time phase to perform the functions of morphological generation and recognition.

The result of the compilation process is the creation of a computer-manipulatable data structure that can be used, with the lexicon 22 to perform morphological processing. The data structure includes chains of form rules indexed by paradigms. As the form rules of one paradigm can be inherited by another, the present invention provides that associated with each form rule is the list of paradigms that "share" the form rule. Also associated with each form rule is a list of the morpho-syntactic properties of the form. Further, should an orthographical rule apply to a particular form rule, the present invention provides that "variant" form rules be associated with each form rule in the data structure, to enable the application of the orthographic rules in specific circumstances. The distinctive aspects of the compiler module 18 and the compiled rule set 20 are described more fully below.

In the run-time phase, the executable program statements comprising a morphological generator module 24 are used by the processor 6 to perform the function of morphological word form generation. As stated above, morphological generation involves the process of generating word forms, given a lexicon and a compile, d rule set to guide the generation of word forms. The computer program statements of the morphological recognizer module 26 are used by the processor 6 to perform the functions of morphological recognition. As stated above, morphological recognition entails the function of identifying one or more instances of a set comprising the word form, a paradigm identifier, a set of morphosyntactic properties, and a corresponding lemma (i.e., lexical entry) for a given surface form. The statements of the morphological recognizer module 26 enable the processor 6 to identify the lemma and the morpho-syntactic properties indicated by the surface form.

As stated above, the method and apparatus for morphological text analysis of the present invention has wide application in devices for information management and data processing fields, such as in machine translation systems, natural language processing systems, as well as on-line dictionary, grammar checking, and thesaurus systems. In an exemplary embodiment, FIG. 1 depicts the data structures and computer program modules of the present invention, used in conjunction with a run-time text retrieval module 28. In the exemplary embodiment, the processor 6 is connected to the run-time text retrieval module 28 and a plurality of database text files 29, comprising natural language text. Each surface form in the text is indexed according to its citation form. In such a system, the processor 6 prompts the user for a query, using the executable program statements comprising the text retrieval system and generates calls to either the morphological generator module 24 or the morphological recognizer module 26. For additional information on aspects of an indexed database retrieval system, the reader is referred to pending U.S. application, Ser. no. 07/723,229, issued as U.S. Pat. No. 5,251,316 entitled "Method and Apparatus for Integrating a Dynamic Lexicon into a Full-Text Information Retrieval System", filed on Jun. 28, 1991 and U.S. application, Ser. No. 07/472,245, entitled "A Direct Manipulation Interface for Boolean Information Retrieval", filed on Jan. 30, 1990, now U.S. Pat. No. 5,175,814 which are expressly incorporated herein by reference. The description now to follow will describe in detail ,the features of the present invention as outlined above.

Morphological Rule Specification

The M language 12, as stated above, enables the user 14 to specify a set of declarative rules 16 that describe the inflectional morphology of a given natural language. To write the description, the user 14 is provided with a syntax that will dictate the form and sequences of a set of declarative statements to describe the inflectional morphology of a natural language. APPENDIX I lists the formal syntax for the M language 12. This syntax enables a user to capture the linguistic generalities inherent in natural languages, because it permits: 1) the specification of a hierarchy of paradigms, where each sub- paradigm inherits the features and form construction rules of its parent; 2) the separate specification of orthographic rules, which include sets and variables, that apply across inflectional paradigms; 3) the use of intermediate and surface form rules; 4) the specification of features which describe the morpho-syntactic properties associated with a form construction; 5) the ability to override intermediate form rules (as well as surface forms) in certain exceptional cases with entries in the lexicon; 6) the ability to declare affix variables (which will be described below); and 7) the ability to use form sets (also described below) in presenting a declaration. With the M language 12, the user 14 can specify a compact morphological description that maps well to a description found in grammar books, and thus mastery of the M language 12 syntax does not require a degree in computer science, only mastering of the basic grammatical and orthographical rules of a natural language. The user 14 who builds the description can be any person well-versed in natural language, such as a professor or a writer, APPENDICES II and III list partial descriptions of a set of user declared rules 16 for the German and French languages, respectively. Various aspects of the M language 12 syntax will now be discussed with reference to examples appearing in those appendices.

The user 14 specifies a set of prepared declarative rules 16 that provide a morphological description of a natural language, using the set of M language 12 statements, organized into a set of input data called a rule set. A rule set includes a declaration of a rule set name and a set of elements including paradigms and orthographic rules as follows. An M language 12 paradigm declaration has the following exemplary form:

    ______________________________________
    paradigm: <name> {
    based.sub.-- on: <parent-paradigm>
    use.sub.-- for.sub.-- instances: <true/false>
    acquisition.sub.-- form: <form-name>
    exemplar: <example>
    form.sub.-- sets {
    <form-st-name> = (<form-list>)
    <form-set-name> = (<form-list>)
    affix.sub.-- vars {
    <affix-var-name> = (<affix-list>)
    <affix-var-name> = (<affix-list>)
    }
    intermediate.sub.-- forms {
    <form-name>: <form-body> [<feat-spec>]
    /<qualifiers>
    <form-name>: <form-body> [<feat-spec>]
    /<qualifiers>
    }
    surface.sub.-- forms {
    <form-name>: <form-body> [<feat-spec>]
    /<qualifiers>
    <form-name>: <form-body> [<feat-spec>]
    /<qualifiers>
    }
    form.sub.-- features{
    <form-name> [<feat-spec>]
    <form-name> [<feat-spec>]
    }
    }
    ______________________________________


Each paradigm is identified by a name and contains a number of sub-declarations.

The basic components of the paradigm declaration are the form construction rules. The present invention provides for surface form rules (which describe the construction of forms that appear in text) and intermediate forms rules (which describe the construction of forms which may never appear in text, but which are useful, because they are basic forms used as steps in the construction of surface forms. In addition, the present invention provides for the declaration of a lexical form, which is stored directly in the lexicon, but which is referenced in the paradigmatic form construction rules by the "LEX" identifier, as will be described below.

A form rule in the M language 12 declaration comprises a left-hand-side (LHS) <form-name> which can be any indicator of a particular inflection, such as "pres-1s," or a simple identifier such as "X", and a number of right-hand-side (RHS) components. The right-hand-side <form-body> may assume one of several formats, as follows:

    ______________________________________
            <form-op> <affix> <stem>
                  or
            <stem> <form-op> <affix>
                  or
            <stem>
                  or
            LEX
                  or
            NIL
    ______________________________________


Where <form-op > is either a plus (+) or minus (-), indicating whether the affix is being added or removed; <affix> is either a string (enclosed in quotes) or the name of an affix.sub.-- var (to be described below); and <stem > is the name of a form.sub.-- set or form that resolves in one or more steps into a stem in the lexicon (i.e., a form that was generated from a form the <form-body> of which comprises LEX). NIL, used only in paradigms that inherit forms from a parent paradigm, indicates that the given form does not exist in the current paradigm. An example use would be for the French verb pleuvoir, which exists only in the third person singular. Its other forms would have a NIL value. Thus, the form construction rules of both surface and intermediate forms distinguish between two major categories of strings: stems, which are any forms which include the primary lexical base of the word, and affixes, which comprise the prefixes and suffixes which can be appended to a stem in the process of word formation. Once an affix is appended to a stem, however, the result is also a stem, since the result also includes the primary lexical base.

Examples of surface and intermediate form rule declarations for the French language in APPENDIX III are found in the paradigm named "verb.sub.-- root":

    ______________________________________
    Intermediate.sub.-- forms{
            BASE: Inf - "IR"
    Surface.sub.-- forms {
            inf : LEX
            pres.sub.-- 1s: BASE + "E"
    . . .
    ______________________________________


Notice that the surface form rule (pres.sub.-- 1s) is formed by taking a stem (BASE) and concatenating it with the affix ("E"). The present invention provides for the chaining of form rules: the stem (BASE) in the form rule is the identifier of an intermediate form rule (BASE), which is formed by taking a stem (inf) and stripping the affix ("ir") from it. The stem (inf) also identifies a surface form rule (inf). The right-hand-side (RHS) of the infinitive form is a (LEX) indicator which indicates that the form (inf) is constructed by accessing the lexicon 22 (FIG. 1) and locating a citation form. Thus, in the construction of the form (pres-1s) there is a chain (pres-1s) to (BASE) to (inf) to (LEX). This chaining of form rules from right-hand-side stem to left-hand-side form name permits a compact description of the language that maps well to grammar book descriptions.

Associated with each surface form is a set of one or more features that provide a description of the morpho-syntactic properties of that surface form. The M language 12 permits the specification of features with the <feat-spec> declaration as follows:

<feat>=<value>

The <feat-spec > declaration comprises a number of feature-value pair specifications, where <feat > is an identifier of the feature and the <value > is the value for the feature. An example of a set of feature value pairs for the form rule (pres.sub.-- 1s) appears in the paradigm (verb.sub.-- root) in the French language user prepared declarative rules 16 listed in APPENDIX III:

    ______________________________________
    form.sub.-- features{
    pres.sub.-- 1s: temps=present, personne=1, nombre=singular
    . . .
    ______________________________________


The example contains a set of three form/feature pairs that are associated with the form rule (pres.sub.-- 1s). Each feature identifier (temps), (personne) and (nombre) describes the category of morpho-syntactic properties used by the associated form. The value declarations (e.g., present) provide specific morpho-syntactic properties for the form within the feature category. The feature/value pairs are associated with a form construction rule through the shared use of the same left-hand-side identifier (e.g., pres .sub.-- 1s).

As can be seen from the example above, the M language 12 syntax provides for form-features to be declared apart from the form rule in a form-feature declaration. To link the form-feature to a form rule, in such a case, form/feature pairs can be declared, as in the example using the form name with which it is to be associated. However, it is also possible that the form features be declared in the form rule itself as in the example found in the user prepared declarative rules 16 for the German language in APPENDIX II. The paradigm verb.sub.-- strong.sub.-- pref has the following declaration:

    ______________________________________
    surface.sub.-- forms{
    finite.sub.-- unattached: FINITE [unattached=true]
    ______________________________________


where unattached=true is the feature-value pair. The two declaration areas in the paradigm are provided merely for convenience, as sometimes it is more convenient (and more readable) to list the sets of feature-value pairs associated with a form in a separate section when there are many feature-value pairs. The format of the <feat-spec> is identical, regardless of whether it is placed within the form rule or in the form.sub.-- features section.

Additionally, multiple sets of feature-value pairs may be used for those cases where there is a single surface form for multiple "paradigm locations," e.g., in English we might have the following forms:

    ______________________________________
    inf: LEX
    pres.sub.-- 3s: inf +
              [tense = present, person = 3, number =
    "s"       singular]
    pres: inf [tense = present, person = 1, number =
              singular;
             tense = present, person = 2, number =
            singular;
             tense - present, person = 1, number = plural;
             tense - present, person = 2, number = plural;
             tense - present, person = 3, number = plural;]
    ______________________________________


where the infinitive form (inf) is used in the right-hand-side of many different form rule constructions. In the exemplary embodiment of the present invention, form features used in a given rule set are specifically declared and a range of values must be assigned to each form-feature. Thus, in the paradigm, <feat> must have been previously declared in a separate feature values section and <value > must be one of the previously declared values for the given feature. Feature values are declared separately from paradigms as follows:

    ______________________________________
            feature.sub.-- values {
               <feature> = (<value-list>)
               <feature> = (<value-list>)
             }
    ______________________________________


Where <value-list> represents a value followed by zero or more occurrences of a comma followed by another value. One or more features may be declared in this manner. An example in the English language is: tense=(present, past, future). In APPENDIX II there appears the following feature value set declaration from the user prepared declarative rules 16 of the German language:

    ______________________________________
    feature.sub.-- values{
    tense = (present.sub.-- indicative, imperfect, present.sub.-- subjunctive,
    special.sub.-- subjunctive, past.sub.-- participle)
    person = (1, 2, 3)
    number = (singular, plural)
    non.sub.-- finite = (true, false)
    unattached = (true, false)
    ______________________________________


The left-hand-side of each declaration identifies the category of morpho-syntactic property with which a word form can be identified (e.g., tense) The right-hand-side provides a set of possible values over which the feature identifier may range (e.g., the person category can be either first person, second person or third person.

Referring again to the declarations of surface and intermediate form rules, the <form.sub.-- body> and <feature-spec> declarations, a RHS argument of a form rule may in addition, contain one or more qualifiers. Qualifiers are used at run-time to indicate that a given lemma may override the given (intermediate) form rule with a string stored in the lexicon. The present invention provides an exemplary override statement:

allow.sub.-- lexical.sub.-- override

This is used to indicate that the given form may be overridden by a stem in the lexicon; in the exemplary embodiment, it is used only for intermediate forms, as any surface form may be overridden by a lexical entry. For example, in APPENDIX III there appear the following forms in the paradigm verb.sub.-- root:

    ______________________________________
    Intermediate.sub.-- forms{
    . . .
    FUT: inf          /allow.sub.-- lexical.sub.-- override
    surface.sub.-- forms{
    inf: LEX
    ______________________________________


In constructing the form (FUT) for a given lemma, the syntax of the M language 12 dictates that the lexicon 22 first be searched to determine if there is a (FUT) entry for that lemma before the form rule (inf) is accessed to construct the form. If there is a form FUT in the lexicon 22 for a given lemma, that form will be used instead of following the inf construction chain, thus, allowing the lexicon form in that case to "override" the normal construction rule. If no FUT form appears in the lexicon 22 for the given lemma, the processing will continue using the form rule (inf).

As mentioned above, the present invention provides for a hierarchical organization of paradigms, whereby one paradigm may inherit some or all of the form construction rules of another paradigm. The "based-on" declaration (shown in the general paradigm declaration above) indicates the name of the paradigm from which the given paradigm inherits its rules. If the current paradigm is not based on any other paradigm, then that line is omitted. In the compilation phase, the processor 6 (FIG. 1) uses the compiler module 20 (FIG. 1) to process the statement based-on as a signal to incorporate form rules from the parent paradigm into the current paradigm according to the method of the present invention.

It is the left-hand-side of the form rule, i.e. the form-name, which governs which rules are inherited from a parent paradigm. Only those rules with form names that do not exist in the child paradigm are inherited from the parent. One can think of the rules in the child paradigm as overriding rules with the same form name from the parent paradigm. The ability to inherit and override not only surface forms but also intermediate forms (without requiring the restatement in the child paradigm of rules that use an overridden intermediate form but otherwise behave just like the parent rules) can greatly simplify and condense the rule set that a user need develop to describe a natural language such as French or German.

In addition, the present invention provides in a paradigm declaration two additional features for a compact description of an inflectional morphology: form sets and affix variable sets. Form sets represent a short-hand notation that sometimes can be used to declare a set of similar forms. Within a paradigm declaration, the M language 12 syntax provides that the form sets be declared as follows:

    ______________________________________
    form.sub.-- sets {
           <form-set-name> = (<form-list>)
           <form-set-name> = (<form-list>)
    ______________________________________


where the <form-set-name> identifies the form set specified on the right-hand-side. The <form-list> is a list of one or more of the form-names declared in the following sections or inherited from the parent paradigm, each separated by a comma. A typical declaration appears in the paradigm provided in the user prepared declarative rules 16 for the German language in APPENDIX II.

    ______________________________________
    form.sub.-- sets {
    FINITE=(pres.sub.-- 1s,pres.sub.-- 2s,pres.sub.-- 3s,pres.sub.-- 1p,pres
    2P,pres.sub.-- 3p,
    imp.sub.-- 1s,imp.sub.-- 2s,imp.sub.-- 3s,imp.sub.-- 1p,imp.sub.--
    2p,imp.sub.-- 3p)
    ______________________________________


There are cases in which a'set of forms (instantiated by different strings and referring to different values of the same set of features) exhibit the same behavior with respect to form construction rules that use them to create new forms. In such cases, it may be useful to define a form set name to serve in form construction rules as a placeholder for any one of a set forms. The present invention allows one to define such a form set and use it wherever a stem can appear in the right-hand-side of a form rule, such as, for example, the declaration using the form set FINITE in the German language paradigm verb.sub.-- strong.sub.-- pref of APPENDIX II:

    ______________________________________
    surface-forms {
    finite-unattached: FINITE [unattached = true]
    ______________________________________


In the compilation phase, the processor 6 will take each form rule accessing a form set and "conflate" it with the members of the form set to create a new set of form rules, each having as features the union of features associated with the original form rule and the form set member. For example, the form finite-unattached.pres.sub.-- 1s, as described above, has the associated feature/value pair, personne=1, whereas the finite-unattached.pres.sub.-- 25 has associated with it the feature value pair personne=2. Representationally, the conflating process would yield:

    ______________________________________
    surface-forms {
    finite-unattached . pres.sub.-- 1s = pres.sub.-- 1s [unattached = true]
    finite-unattached . pres.sub.-- 2s = pres.sub.-- 2s [unattached = true]
    finite-unattached . pres.sub.-- 3s = pres.sub.-- 3s [unattached = true]
    finite-unattached . pres.sub.-- 1p = pres.sub.-- 1p [unattached = true]
    finite-unattached . pres.sub.-- 2p = pres.sub.-- 2p [unattached = true]
    finite-unattached . pres.sub.-- 3p = pres.sub.-- 3p [unattached = true]
    finite-unattached . imp.sub.-- 1s = imp.sub.-- 1s [unattached = true]
    finite-unattached . imp.sub.-- 2s = imp.sub.-- 2s [unattached = true]
    finite-unattached . imp.sub.-- 3s = imp.sub.-- 3s [unattached = true]
    finite-unattached . imp.sub.-- 1p = imp.sub.-- 1p [unattached = true]
    finite-unattached . imp.sub.-- 2p = imp.sub.-- 2p [unattached = true]
    finite-unattached . imp.sub.-- 3p = imp.sub.-- 3p [unattached = true]
    . . .
    ______________________________________


A form whose derivation includes no form sets will be referred to as a "ground" form, since it always corresponds to a precise string with associated set (or sets) of feature values. A form constructed using a form set is known as a "variable form", since different instantiations of the form may correspond to different strings with different feature-value sets. Whenever a form set, or a variable form, is used in the right-hand-side of a form construction rule, the resulting left-hand-side form must be a variable form.

Form sets are a useful way of generalizing behaviors, since rules can be written which act on any of a set of forms, rather than requiring a separate rule to be written for each member of the set. However, it can also be useful to refer by name to the "ground" forms implied by a variable form. For this reason, the present invention permits a naming convention by which ground form names can be generated. It works as follows: the first time a form set is used to construct a variable form, the ground form names are the concatenation of the strings <variable form name>"."<form set member name>.

Affix variables (affix.sub.-- vars) are similar to form sets in that the variable represents a part of the form that varies in a generalizable way. Declarations of affix variables are permitted in the M language 12 to handle special cases when there are a small, fixed class of affixes whose value must unify with a value stored in the lexicon, such as for words with separable prefixes in German. However, in contrast to the form set whose scope is the part of the form that contains the lexical "stem", the affix variable ranges over a set of affixes. In APPENDIX II, an example of a declaration appears in the verb.sub.-- strong.sub.-- pref paradigm of the user prepared declarative rules for the German language:

    ______________________________________
    affix.sub.-- vars {
    sep.sub.-- prefix= (ab,an,auf,aus,ein,fort,heim,her,hin,mit,nach,
    nieder,vor,weg,zu,zuruck,zusammen)
    ______________________________________


As used in a form rule, the affix variable can be used where an affix might appear in the right-hand-side, as for example in the German language paradigm verb.sub.-- strong.sub.-- pref of APPENDIX II where the surface form past.sub.-- part is specified:

    ______________________________________
    surface.sub.-- forms {
           past.sub.-- part: + sep.sub.-- prefix p.sub.-- part
    ______________________________________


The affix variable sep.sub.-- prefix specifies a range of values that can be used in constructing the form past.sub.-- part. The affix.sub.-- var name (e.g., sep.sub.-- prefix) is treated as a feature which can be turned on or off using a lexicon reference, while the <affix-list>, which represents a list of all the possible affix strings, is treated as the range of all affix values. A feature/value pair consisting of an affix variable name and value may be used as a constraint by requiring it to unify with equivalent information stored in a lexical entry.

Referring again to the paradigm declaration above, the paradigm declaration also includes a use.sub.--`for.sub.-- instances field, which takes the value of "true" (or "yes") or "false" (or "no"). If the value is false or no, then the paradigm is used solely as a source for inheriting rules, i.e., there are no lemmas in the lexicon that are associated with the paradigm.

In addition to the declaration of paradigms, the present invention provides separate declaration syntax for orthographic rules. Orthographic rules are used to specify spelling changes that occur across paradigms and that are dependent on the orthographic or phonetic context, such as a silent "e" or accented final syllable, rather than on the specific grammatical construction patterns of a paradigm. Orthographic rules are declared as follows:

    ______________________________________
    ortho rules {
    -> <result> (<parameter>) <lhs>
    ______________________________________


Where <rule-type> is either SUFFIX or PREFIX; <parameter> is the name of an orthographic feature, which may be prefixed by a tilde () to indicate negation; <lhs> is the left-hand-side of the rule; and <result> is the construction result that will occur when the rule applies. The form of the left-hand-side <lhs> of the orthographic rule is dependent on <rule-type>. In the case of a SUFFIX, the form of the rule is as follows:

[<context>]<rule-op>[<affix>]

Where <rule-op> is either a plus (+) or minus (-), to indicate whether the suffix is being added or removed; <affix> describes the orthographic constraints on any suffix satisfying the rule; and <context> describes the orthographic constraints on any stem satisfying the rule. The <affix>, <context> and <result> components each comprise a sequence (called an affix variable sequence) comprising any combination of literal strings (enclosed in quotes) and/or any variables declared in a VARS section, as is described below. The context and result sequences then become affix sequences in the form rule variants created during rule conflation. It is assumed that strings which match the <context>, <affix> and <result> components may extend beyond the specified segments, except in those cases of an affix literal value (i.e., a character or string of characters) that ends or begins in a pound sign (#), which indicates end of word for a suffix, or beginning of word for a prefix. For example, +["e"] [<context>] is interpreted as the addition of any prefix that ends with the letter e, whereas +["#ge"] [<context>] is interpreted as the addition of the prefix "ge."

As described above, the <context>, <affix> and <result> components in an orthographic rule declaration can contain character strings and/or variables. The M language 12 syntax provides that all variables used in the orthographic rules be previously declared in a separate declaration as follows:

    ______________________________________
           vars {
             <set-name> <var-list>
             <set-name> <var-list>
           }
    ______________________________________


Where <set-name> must be the name of a previously declared set (to be described below) and <var-list> represents the name of a variable followed by zero or more occurrences of a comma followed by the name of another variable. One or more var-lists may be declared in this manner. An example is:

VOWEL vow1, a.sub.-- or.sub.-- o, vow2.

where VOWEL identifies sets of characters or strings which must also be separately declared as follows. As used in the orthographic rule as an element in a <context>, <affix> or <result> sequence, the variables (vow1), (a.sub.-- or.sub.-- o), and (vow2) would be references to specific instances of members of the set VOWEL, i.e., variables which may be bound at run-time to one value of the VOWEL set. In the French language, for example, the user prepared declarative rules 16 in APPENDIX III, contain the following variable declarations:

    ______________________________________
    vars {
           VOW         vowel
           GEM.sub.-- CONS
                       gcons
           LETTER      char
           CONS        const
           E.sub.-- I  e.sub.-- or.sub.-- i
           NON.sub.-- E.sub.-- I
                       a.sub.-- o.sub.-- u
           NON.sub.-- E.sub.-- VOWEL
                       non.sub.-- e
           A.sub.-- O  a.sub.-- or.sub.-- o
           T.sub.-- CONS
                       c.sub.-- d.sub.-- t
    ______________________________________


As mentioned above the set names in the variable declaration are the identifiers of previously declared sets comprising characters and/or strings of characters. A set is specified in a declaration separate from the variable declaration and the orthographic rule declaration. Sets for use within orthographic rules are declared as follows:

    ______________________________________
    sets {
           <set-name> = (<member-list>)
           <set-name> = (<member-list>)
    ______________________________________


where <member-list> represents a letter followed by zero or more occurrences of a comma followed by a letter. One or more sets may be declared in this manner. An example is: VOWEL=(A,E,I,O,U). In the user prepared declarative rules 16 for the French language, in APPENDIX III, the following sets are declared:

    ______________________________________
    sets {
     LETTER = (A,B,C,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,
            T,U,V,W,X,Y,Z,A,E,E,E,I,O,U)
     CONS = (B,C,C,D,F,G,H,J,K,L,M,N,P,Q,S,T,V,W,X,Z)
     VOW = (A,E,I,O,U,A,E,E,E,I,O,U)
     GEM.sub.-- CONS = (L,N,S,T)
     E.sub.-- I = (E,E,E,I)
     NON.sub.-- E.sub.-- I = (A,O,U)
     NON.sub.-- E.sub.-- VOWEL = (A,A,I,O)
     A.sub.-- O = (A,O)
     T.sub.-- CONS = (C,D,T)
    ______________________________________


The set declaration provides a range of values over which the variables used in orthographic rules may range.

Referring again to the declarative form for an orthographic rule, if a given variable appears more than once within the orthographic rule, then the same value must be substituted for each occurrence. For example, the rule: PREFIX (GEM)+[vow1] [cons vow2].fwdarw.[vow1 cons cons vow2] means that when any prefix that ends in a vowel is added to a stem which has a positive value for the GEM parameter and begins with a consonant followed by a vowel, the consonant is doubled. Note also the use of vow1 and vow2 within the rule. The use of different variable names for the set of vowels indicates that they are independent of each other, i.e., the values need not be identical; although they may happen to be identical.

In the user prepared declarative rules 16 for the French language in APPENDIX III, there are many examples of suffix rules which use variables as affix and context components including:

    ______________________________________
    ortho.sub.-- rules {
     ! geminate the consonant preceding a mute-e syllable
    -> [vowel gconsvowel gcons] + ["E#"]
     gcons "E#"]
    . . .
    ______________________________________


This orthographic rule provides for geminating certain consonants that precede a mute e where the word ends in the affix "e" and is preceded by a vowel and one member from the set of geminating consonants (GEM.sub.-- CONS). The rule specifies that the geminating consonant is doubled where the context and affix of the left-hand-side occur. The variable (vowel) can range over the entire set of characters in the VOW set, and the variable (gcons) ranges over the values L,N,S,T in the set (GEM.sub.-- CONS).

If the orthographic rule applies to a PREFIX, the rule takes the form as follows:

<rule-op>[<affix>] [<context>]

where everything is as defined in the SUFFIX rule except that the context and affix components are reversed, as is the pound sign that indicates the end of the word.

An example of a prefix orthographic rule is found in the German language description in APPENDIX II:

    ______________________________________
    ! ensure that participle has no more than one unstressed prefix and
    ! that weak verbs ending in -ieren do take ge- prefix, eg,
    ! bestelien --> besteilt and studieren --> studiert
    -> [char]O.sub.-- GE) + ["GE"] [char]
    ______________________________________


The left-hand side of the rule comprises the rule operator (+), the prefix ["GE"] and the context [char]. In the context, char is a variable name used to specify a set of characters, or character strings, that would serve as possible contexts for the orthographic rule. The variable (char) would be declared in a separate declaration and associated with a set of range values.

The rule states that if the prefix "GE", is to be added to a word beginning with any letter and the word has a positive value for the NO.sub.-- GE feature, then the GE is dropped. If this orthographic rule applied to every form rule, then the GE prefix would never be applied. Notice that this orthographic rule only applies to certain words, as limited by the use of the parameter NO.sub.-- GE. The present invention provides that application of the spelling rules may be controlled by orthographic parameter features associated with lemma entries in the lexicon 22. Where the lexicon entry contains information showing that the parameter of the orthographic rule applies, only then will the spelling rule apply. For example, a lexicon entry for the German word studieren might contain the following information:

citation: studieren

paradigm: verb.sub.-- weak

stems: inf=studieren

orth.sub.-- features: NO.sub.-- GE=yes

The orth.sub.-- features entry contains the parameter NO.sub.-- GE and an indicator stating that the orthographic rule should apply in forming inflections of studieren.

The description of the M language 12 syntax above outlines the general declarative features of the language. In declaring the set of rules according to the constructs above, an exemplary embodiment of the present invention provides for the following additional syntax requirements. A rule set must include a rule name, a set of declared features with their respective range of values, and a set of paradigms. The rule set may optionally contain a set of orthographic rules and accompanying sets of variables and sets of values over which the variables may range. Additionally, the declaration of a rule set must be arranged in the following order, the rule set name, the feature values, sets (used for orthographic rules) variables, (used for orthographic rules) orthographic rules and paradigms. In declaring paradigms, a parent paradigm must be declared before its children.

The Lexicon

In addition to the user prepared declarative rule set 16 to describe a natural language, the user 14 must also specify a lexicon 22. In an exemplary embodiment of the present invention, the lexicon 22 is a database of word entries that are each associated with some paradigm in the user specified rule set 16. In an exemplary embodiment, the information retrieval program AI-STARS manufactured by Digital Equipment Corporation can be used to create the database entries. The word entries in the lexicon 22 contain identifiers following the syntax of the M language 12. The syntax of the M language 12 provides a connection to associate the appropriate information stored in the lexicon 22 entries with appropriate form rules. For example, in APPENDIX III the user prepared declarative rules 16 (FIG. 1) for the French language list some typical lexical entries:

    ______________________________________
    citation: sentir
    paradigm: verb.sub.-- re.sub.-- ir
    stems: inf = sentir, PRES.sub.-- SUB = sen
    features:
    dation: pouvoir
    paradigm: verb.sub.-- re.sub.-- ir
    stems: inf = pouvoir, BASE = peu, pres.sub.-- 1s = peux, pres.sub.-- 1s
         puis, pres.sub.-- 2s = peux, pres.sub.-- 3p = pouvent, PRES.sub.-- P
    =
         pouv,PAS = pu, PAS.sub.-- P = pu, PRES.sub.-- SUB = pui,
         sub.sub.-- pres.sub.-- 1p = puissions, pres.sub.-- 2p = puissiez,
    FUT =
         pourr
    features:
    ______________________________________


A citation form (indicated by the "citation" by word) is used to identify the lemma. This string (e.g., sentir or pouvoir) should correspond to the form of a word one would use to look up a word in a regular dictionary. A citation form need not be unique. A noun and a verb may well share the same citation form. The name of the paradigm which describes the morphological behavior of the lemma is also provided in each lexicon to link the lexicon citation to the appropriate form generating rules. The user can specify in some cases one or more stem forms, to be used as starting points to derive the various inflected forms. For example, a regular verb (such as sentir above) might only require the infinitive stem to be specified in the lexicon 22, whereas a highly irregular verb (such as pouvoir above) may require stems for a number of forms. The English verb "run," for example, might require both the present stem "run" and the past stem "ran" to be stored in the lexical entry. Each stem stored in the lexicon 22 must be associated with the name of a corresponding form in its paradigm. Thus, for each stem, the present invention provides that a form name (the LHS of a form rule) must be provided in the lexicon 22. In this way, the present invention can map lexical stems to their proper use in the paradigm.

In addition, the present invention provides that features and values which are sometimes required by the user declared paradigms can be specified in the lexicon 22. Lexical override, discussed above, is one such feature where information in the lexicon 22 (other than the paradigm name, form name, and stem) is necessary to constrain morphological processing.

Parameters are another type of feature, pertaining to the application of an orthographic rule, that can be specified in a lexical entry. For example, the lexicon entry for the German word studieren, described above, contains the parameter (NO-GE) corresponding to an orthographic rule parameter, thereby indicating that the orthographic rule should apply in forming inflections of studieren. When a parameter is declared in an orthographic rule, a lexical entry for a word must contain a corresponding parameter value in order for the spelling rule to apply. Using this feature, the user is able to easily control the application of the orthographic rule to only the appropriate words.

Optionally, a lexical item may contain other kinds of information. For example, a lexicon entry can contain features and values which are independent of the paradigm, For example, if the gender of a French noun is not altered or determined in a paradigm (via affixation), then the gender of the noun may be indicated via a feature/value pair stored directly in the lexicon, as for example: "chapeau"gender=masc. Additionally, a lexical entry may contain pointers to word senses. A lemma may have one or more word senses, which are stored in the lexicon 22 as independent objects, and other syntactic information, such as subcategorization classes.

Referring again to FIG. 1, the user 14 inputs both a set of prepared declarative rules 16 according to the syntax of the M language 12 and a set of lexicon entries 22 as described above. This input is stored in the secondary storage location 10 until the compilation phase processing begins.

Rule Set Compilation

a. Overview

In the compilation phase, the processor 6 (FIG. 1) uses the executable statements and data structures comprising the compiler module 18 to read the set of user prepared declarative rules 16 and create the compiled rule set 20, which can be used for all phases of run-time morphological processing. FIGS. 2A-2I depict exemplary process flows for the various steps of compiling the set of user prepared declarative rules 16, such as those for the German language in APPENDIX II or those for the French language in APPENDIX III. Moreover, FIGS. 2A-2I depict the flow of control that the processor 6 follows in using the compiler module 18 to create a computer-manipulatable data structure to perform morphological processing. Additionally, a high-level description of the compilation process is presented in the exemplary algorithm listed in APPENDIX IV.

When the compiler module 18 (FIG. 1) is invoked by a user command, the processor 6 retrieves the set of user prepared declarative rules 16 and processes them according to,the statements in the compiler module 18. FIG. 2A depicts an exemplary process flow for rule-set compilation and provides an overview of the entire compilation process, the flow of control dictated by the statements of the compiler module 18. In step 40, the processor 6 uses the statements of the compiler module 18 to create the compiled rule set 20, initially comprising a parse tree. A parse tree is a heirarchical representation of the data contained in the user prepared declarative rule set 16. The root of the tree represents the top of the hierarchy and the elements of the user prepared declarative rules 16 are represented as nodes of the tree in levels. In the first level down from the root, the primary elements of the user prepared declarative rules 16 are grouped into categories identified by the nodes: feature values, sets, set variables, orthographic rules and paradigms. Below each categorization node, is a node for each declared element (e.g., a node for each of the declared paradigms). In the level below each declared element, there is a node corresponding to each of the components of the element. For paradigms, there would be nodes, for example, identifying surface form rules and intermediate form rules. Attached at the lowest nodes are leaf nodes containing the most specific data. For example, under each form rule node is a leaf node containing the value of the form operator (i.e., + or -). The tree, during the compilation process, evolves into the data structure that is the compiled rule set 20 by the operations described below.

With the set of user prepared declarative rules 16 parsed in step 40, the processor 6, in step 42, begins a loop to further process each declared paradigm identified in the tree data structure. In step 44, the semantic type of each component in the right-hand-side (RHS) of every form rule is "disambiguated" and, where appropriate, "chained" by a procedure that is described in detail below in FIG. 2B. As mentioned above, a form rule comprises a left-hand-side identifier of the form (the form name) and a set of right-hand-side construction components, i.e., a stem and, optionally, an affix and, if the affix is present, an operator. The right-hand-side stem can be an identifier, i.e., the left-hand-side (LHS) name of another form rule or a form set, and the RHS affix can be an identifier of an affix variable (affix.sub.-- var). This is illustrated by the chain of references from (pres.sub.-- 1s) to (BASE) to (inf) to (LEX) in:

    ______________________________________
              intermediate.sub.-- forms {
               BASE: inf - "ir"
               . . .
              }
              surface.sub.-- forms {
               inf : LEX
               pres.sub.-- 1s: BASE " "e"
              }
    ______________________________________


Disambiguation is the process of taking the identifiers of other form rules or form sets or affix variables (affix.sub.-- vars) and creating actual references or links to the appropriate data structures. The disambiguation process creates a chain of references between form rules, representing the steps of its construction. The chain must terminate with a lexical form indicator "LEX" to signal that a character string is stored in the lexicon. Disambiguation replaces the stem form names with links to the actual form rule data structures that represent the different form rules. Also, in the case of affix variables, a feature comprising the affix variable name gets associated with the form rule. The syntax of the M language 12 treats the affix variable name as a feature and the set of associated affix strings is treated as the range of that feature's value. This feature must be associated with each form that references the affix variable. The syntax is necessary because the value of the affix is dependent on the lemma. Therefore, the value must be obtained during generation so that the appropriate affix is used during form construction; during recognition, the value is obtained by matching the surface string against each value of the affix variable. Then that value must match the value associated with the lemma in the lexicon, i.e., that value is a feature constraint that must unify with the features of the lemma found in the lexicon. The disambiguation process is described in detail below in the discussion pertaining to FIG. 2B.

Referring again to FIG. 2A, after the form rules have been "disambiguated" (in step 44) the paradigm processing loop (step 42) continues in steps 46 and 48 to process any paradigms containing the statement "based-on" for form rule inheritance. In addition to form rule chaining, the present invention presents a method and apparatus for paradigm inheritance, in which one paradigm (a sub-paradigm) inherits the form rules contained in another paradigm (a super-paradigm). In step 46, the processor 6 determines whether the currently examined paradigm is "based-on" another paradigm by searching for a (based-on) declaration in a leaf node of the tree branch (in the data structure that is evolving into the compiled rule set 20) of that particular paradigm. If, in step 46, a (based-on) declaration is found, then in step 48, a form inheritance procedure is invoked. The inheritance procedure, which is explained in detail below with reference to FIGS. 2C-2E, creates references to the appropriate data structures that will allow the current paradigm (sub-paradigm) to "share" the form rules of its super-paradigm. Upon completion of the form rule inheritance process, each "shared" form rule node in the tree data structure contains a list of all the paradigms that share it.

Referring again to FIG. 2A, the paradigm processing loop (begun in step 42) continues after the form rule inheritance procedure (steps 46 and 48) with the expansion or "conflation", in step 50, of any form set that is referenced by any form rule. In step 50, form set conflation comprises the process of expanding or filling out the compact descriptions of the user declared form rules 16 by replacing each form rule that contains a RHS reference to a form set with a set of form rules, one form rule for each member of the form set. For example, the user 14 might declare the following form set, intermediate form rules and surface form rules in a paradigm:

    ______________________________________
           form sets {
            DEGREE = (POS, COM, SUP)
           }
           intermediate forms {
            BASE:LEX
            POS: BASE
            COM: BASE + "er"
            SUP: BASE + "est"
           {
           surface forms
           nom.sub.-- msing: DEGREE + "er"
           gen.sub.-- msing: DEGREE + "es"
           dat.sub.-- msing: DEGREE + "em"
           }
    ______________________________________


The form set DEGREE is referenced in each of the three surface form rules: nom.sub.-- msing, gen.sub.-- msing and dat.sub.-- msing. The form set DEGREE includes references to three intermediate form rules POS, COM and SUP. The form set conflation process replaces the three surface form rules that reference the form set DEGREE with the following set of surface form rules:

    ______________________________________
    surface.sub.-- forms {
     nom.sub.-- msing.POS
                         POS + "er"
     nom.sub.-- msing.COM
                         COM + "er"
     nom.sub.-- msing.SUP
                         SUP + "er"
     gen.sub.-- msing.POS
                         POS + "es"
     gen.sub.-- msing.COM
                         COM + "es"
     gen.sub.-- msing.SUP
                         SUP + "es"
     dat.sub.-- msing.POS
                         POS + "em"
     dat.sub.-- msing.COM
                         COM + "em"
     dat.sub.-- msing.SUP
                         SUP + "em"
    ______________________________________


For each surface form previously listed, there are now three surface forms--each corresponding to one of the references in the form sets. Each surface form is uniquely identified on the left-hand-side (LHS) by using the prior form name concatenated with a form set member identifier.

Referring again to FIG. 2A, after the form set conflation procedure of step 50, the paradigm processing loop (begun in step 42) continues in step 52 with the final step of checking all form rules of a paradigm for cycles. The cycle detection step comprises a process of taking each form rule and following the chain of form rule references to make sure that no reference to a form rule is repeated within the chain. As each form rule must eventually chain back to a LEX indicator that signals a string in the lexicon, a repeated reference to a form construction rule in the chain indicates that the form rule will cycle rather than map back to a string form. For example, the following declarations would be an incorrect declaration for verbs in the French language:

    ______________________________________
              intermediate.sub.-- forms {
              inf: BASE-"ir"
              surface.sub.-- forms {
              BASE: pres.sub.-- 1p + "ir"
              pres.sub.-- 1p: inf + "e"
              }
    ______________________________________


This example provides the following nonsensical chain of (pres.sub.-- 1p) to (inf) to (BASE) to (pres.sub.-- 1p) and is an illegal declaration according to the exemplary syntax of the M language 12. The process is to begin with the surface form rules for each paradigm and chain back through the form references, checking for repeated references.

During the cycle checking process, the present invention also provides a method and apparatus for propagating any morpho-syntactic features associated with intermediate forms up through the chain to the surface forms. When a surface form is constructed by a succession of form rules, it would be possible to simply associate all the grammatical features of the result with the surface form itself. However, it is sometimes the case that it is advantageous to associate some subset of the final set of features with one or more intermediate forms. For example, in German, a user might construct the imperfect tense surface forms in two steps, first constructing an imperfect stem, then appending suffixes to it to indicate person and number:

    ______________________________________
    intermediate-forms {
     IMP:BASE       [tense = imperfect]
    surface-forms {
     imp.sub.-- 1s:IMP + "te"
                    [person = 1, number = singular]
     imp.sub.-- 2s:IMP + "test"
                    [person = 2, number = singular]
     imp.sub.-- 3s:IMP + "te"
                    [person = 3, number = singular]
    }
    ______________________________________


The task is to indicate that the surface forms imp.sub.-- 1s, imp.sub.-- 2s, and imp.sub.-- 3s also carry the imperfect tense as a feature value. The present invention provides a default mechanism to pass feature values from included stems to resulting forms. Any form constructed using the append operator (+) will, by default, also include the features of its stem form rule without the user having to explicitly include those features in the list of features associated with the resulting form. This process is called "feature propagation", because features from intermediate forms are propagated to any forms which utilize them in their construction. Feature propagation is a notational convenience, obviating the need to specify certain features many times in resulting forms when they can be specified once in an intermediate form and propagated automatically to the resulting forms. It also reflects well, from a linguistic standpoint, how the grammatical interpretation of a surface form is built up from its parts. Feature propagation does not occur across construction rules using the removal (-) operator. Typically, the removal operator is employed to create new base strings by removing affixes. Hence, it would be unusual to desire features to propagate across such operations.

Referring again to FIG. 2A, the processor 6 begins a process for cycle checking and feature propagation as follows. In step 52, the processor 6 searches for surface form rule nodes in the tree data structure (which is evolving into the compiled rule set 20) to determine if all the form rules for a given paradigm have been processed. If so, the processor 6 proceeds, in step 59, to a get next paradigm, and the flow of the process returns again to step 42, to begin processing another paradigm. If, in step 52, all the form rules for a given paradigm have not yet been checked, the processor 6, in step 57, will get another form rule from the tree data structure. Then, in step 58, the processor 6 will call a cycle check/feature propagation procedure to determine if the chain of form rule references is free of cycles and to propagate features (for form rules with a + operator) in a procedure described in detail below with reference to FIG. 2F. If the procedure invoked in step 58 encounters a cycle in the form rule in question, the processor 6 generates an error in step 60 and then returns to step 52 to process another form rule, until all form rules for a paradigm are processed.

After the cycle check and feature propagation procedures described in steps 52-60 in FIG. 2, processing of that paradigm is complete. The processor 6 then returns to step 42 to determine if there are other paradigms to process, and if so, the processor 6 will execute steps 44-60 described above.

When, in step 42, the processor 6 finds that all paradigm form rule processing is complete, the processor 6 will then proceed, in step 54, to "conflate" the orthographic rules with the form rules. A feature of the present invention is the fact that orthographic rules are described separately from the grammatical paradigms. The compiler plays a special role in permitting the user to freely declare orthographic rules apart from grammatical construction paradigms in that it automatically compares each orthographic rule with each form rule to determine to which form rules each orthographic rule might apply. To make this determination, the processor 6, in step 54, will compare the affix and operator in the left-hand-side of the orthographic rule with the operator and affix in the right-hand-side of the form rule. Once the processor 6 determines in step 54 that set of "matching" form rules, the processor 6 cycles through the set, conflating the orthographic rule with each form rule, in a process that will be described in detail below in FIG. 2G, creating an inner variant intermediate form rule and an outer variant form rule of the same type as that of the matching form rule to map the correct spelling of the word to the grammatical construction in a process as described below. If the orthographic rule contains a parameter, that parameter is associated with the newly created outer variant form rule.

After the processor 6 performs the steps of orthographic rule conflation in step 54, the process of compilation is completed, in step 56, by building a discrimination net that is incorporated as a branch of the tree structure of the compiled rule set 20. A discrimination net is a standard data structure used in artificial intelligence programming and is commonly used by systems for morphological recognition. The present invention creates a discrimination net to index the final suffixes of those surface form construction rules having a plus (+) operator in the right-hand-side, and additionally stores together references to those surface form rules having a minus (-) operator on the right-hand-side or a prefix in a "non-netted" form rule set that is also included in the data structure of the compiled rule set 20. Additionally, in step 56, the processor 6 provides references in the discrimination net to corresponding surface form rules, stem form rules and shared paradigms. The process of building the discrimination net is described more fully below with reference to FIGS. 2H-L.

Once the discrimination net is complete, the processor 6, in step 61, stores that discrimination net, along with the rest of the compiled rule set 20 (FIG. 1), in the secondary storage area 10 (FIG. 1). Compilation phase processing is now complete, and the processor 6, in step 62, will return and commence a run-time phase upon user command. However, the processes of disambiguation and chaining, form inheritance, cycle checking, orthographic rule conflation and discrimination network building are described in detail below.

a. Right Hand Side Disambiguation and Chaining of Form Rules

In FIG. 2A, step 44, the processor 6 invokes a procedure to "disambiguate" and create form rule "chains", using the right-hand-side (RHS) of each form rule in a paradigm. FIG. 2B depicts exemplary process flow for the right-hand-side form rule disambiguation and chaining process of step 44 in FIG. 2A. The process of disambiguation is to take each component of the RHS of a form rule, and if that component has as its value an identifier, find the corresponding structure (i.e., form rule, form set or affix variable (affix.sub.-- var)). In the case of a form or affix variable name, the processor 6 will replace the name with a reference to the actual form rule or affix variable data structure in the compiled rule set 20. As stated above, the syntax of the M language 12 (FIG. 1) treats the affix variable name as a feature and the set of associated affix strings are treated as the range of that feature's values. This is necessary because the value of the affix is dependent on the lemma.

Therefore, the value must be obtained during generation so that the appropriate affix is used during form construction; during recognition, the value is obtained by matching the surface string against each value of the affix variable. That value must unify with the features of the lemma.

In the case of a form set name, a reference to the form rule stem is stored for subsequent form set conflation. Referring to FIG. 2B, the processor 6 begins a loop to process the form rules of each paradigm on a paradigm by paradigm basis. In step 70, the processor 6 checks whether all form rules for a given paradigm in the processing loop of FIG. 2A have been processed. Referring to FIG. 2B, if in step 70, there is an unprocessed form rule, the processor 6, in step 72, will retrieve the form rule from the corresponding node in the tree data structure (which will become the compiled rule set 20). In step 74, the processor 6 begins a loop to process each RHS component of the form rule. If there is a component not yet processed, the processor 6, in step 76, will get the next component. In step 78, the processor 6 determines whether or not the RHS component is an identifier to be disambiguated. If the component is not an identifier (i.e., it is either a literal string, the special indicator NIL (meaning the form rule does not exist in the paradigm) or the special indicator LEX (meaning that the form is stored in the lexicon 22), the processor 6 will stop processing that current RHS component and return to step 74 to process another RHS component if there is one.

Referring again to step 78, if the current component is an identifier, the processor 6, proceeds to step 80 and makes a determination of whether the identifier is a form name. As described above, a form name is the left-hand-side identifier of a form rule (e.g., "pres.sub.-- 1s"). If so, the processor 6, in step 82, will replace the string of characters representing the stem in the form rule node of the tree data structure with a reference to the actual (stem) form rule. To find that form rule in step 82, the processor 6 will search the set of form rules of the current paradigm, and if it is not found, then proceeds to search hierarchically for "ancestor" paradigms of the current paradigm. An ancestor paradigm is the immediate super-paradigm of the current paradigm (as determined by the based-on statement) or an "ancestor" paradigm of that super-paradigm. Once the string identifier in the component is replaced in step 82 with an actual reference, the processor 6 will return to step 74 to process another RHS component.

Referring again to step 80, if the identifier is not a form name, the processor 6, in step 84, will next determine whether the identifier is a form set name. As described above, a form set can be used as the RHS stem to identify a group of form rules which represent a similar type of form construction. If, in step 84, the identifier is a form set name, the processor 6, in step 86, will place the reference to the component (i.e., form rule stem) in a special node of the tree data structure for subsequent processing. The form rule containing that form set identifier will be replaced later in step 50 (FIG. 2A) with a set of form rules, one for each member of the form set. Next, the processor 6, after completing step 86, proceeds to the beginning of the component processing loop, step 74, to process any remaining RHS components of the current form rule.

Referring again to step 84, if the identifier is not a form set name, the processor 6, in step 88, will determine whether or not the identifier is an affix variable name. As described above, an affix variable represents a set of possible affix strings that could be applied in a particular form rule construction. If in step 88 the identifier is an affix variable name, the processor 6 will replace, in step 90, the string in the stem of the form rule node with a reference to the actual location in the tree data structure of the compiled rule set 20 corresponding to the affix variable and associates with the form rule a feature comprising the affix variable name. After step 90, the processor 6 returns to the beginning of the component processing loop (step 74).

If in step 88 the processor 6 determines that the component is not an affix variable identifier, in step 92 the processor 6 will generate an error, because the component is not one of the legal forms as either a string, a NIL indicator or a LEX indicator, form rule identifier, form set identifier, or affix variable identifier.

Referring to step 74, when all form processing is complete, the processor 6 will return to the beginning of the form rule processing loop (step 70) to process the RHS components of the form rules as described above. When each form rule has been processed, the processor 6, returns to step 44 in FIG. 2A.

b. Form Rule Inheritance (With Cycle Checking)

Referring again to FIG. 2A, in step 48, the processor 6 performs the form rule inheritance procedure. FIGS. 2C-2F depict exemplary process flows for the form rule inheritance procedure. Form rule inheritance, as described, permits the form rules of the present invention to be re-used, or shared, by different paradigms. Where the processor 6 encounters the "based-on" declaration in a paradigm, the processor 6 will perform the process as follows.

In step 46, (FIG. 2A) the processor 6 will search the current paradigm branch of the tree data structure (which will become the compiled rule set 20 (FIG. 1)) to locate a "based.sub.-- on" declaration, i.e. "based.sub.-- on: <parent.sub.-- paradigm>". If one is found, the processor 6 will pass control to the process flow as depicted in FIG. 2C, to step 100, retrieve the form rules from the super-paradigm, and, in step 102, begin to process those form rules that have not been "superseded" by the sub-paradigm referenced in the "based.sub.-- on" declaration. A form rule (declared in the super-paradigm) is superseded when a form rule with an identical LHS is declared in a sub-paradigm. If the form rule has not been superseded, then it is checked for divergence. A divergence check takes the form rule to be inherited and goes back through the chain of stem form rule references to see whether any intermediate form rules in the chain have been superseded in the current paradigm. If so, then an additional branch (a reference to the divergent form rule) must be added to the reference chain in the inherited form rule. Both the form rule screening process and divergence check are described in detail below with reference to FIGS. 2D and 2E.

Referring again to step 102 in FIG. 2C, once the screening out of superseding form rules and the divergence check of inherited form rules is complete, the process, in step 102, will yield a set of form rules from the super-paradigm that will be "inherited" or "shared" by the sub-paradigm. In the tree data structure of the compiled rule set 20, each of the data structures for the inherited form rules of the super-paradigm will contain a "shared by" reference to the sub-paradigm. Those form rules may also contain "divergent path" references as appropriate, as will be described below. Further, a reference to each inherited form rule is stored in the subparadigm as an optimization for the run-time morphological generation function and to aid in divergence checking.

In step 104, the processor 6 will then set an "ancestor" variable to be equal to the super-paradigm of the current paradigm, and the processor 6 will investigate whether the super-paradigm "shares" any form rules with a paradigm super to it. In step 106, the processor 6 determines whether the ancestor paradigm shares any form rules by looking for the set of shared form rules in the tree data structure. If, in step 106, the ancestor is not based on another paradigm, the processor 6, in step 108, will terminate the procedure and return to the step where the form rule inheritance procedure was called (e.g., step 48 in FIG. 2A). Otherwise, in step 110, the processor 6 will retrieve the ancestor's set of shared form rules and, in step 102A, screen out superseded form rules and perform the divergence check on the inherited form rules mentioned above, and described in detail below.

In step 112, the processor 6 continues to check back through the paradigm heirarchy, by determining if the ancestor paradigm has its own ancestor. Again, this is performed by checking for the occurrence of a "based-on" statement in the ancestor paradigm and retrieving the superparadigm. If the ancestor is based on another, in step 116 the processor 6 sets the ancestor marker to equal the reference in the based-on declaration and returns to step 106 to perform the form rule screen and divergence check for any form rules shared by the ancestor. If, in step 112, the ancestor paradigm is not based on another, the processing will terminate and the processor 6 will return to step 48 in FIG. 2A.

c. Form Rule inheritance Processing--Form Rule Screening and Divergence Check

In step 102 of FIG. 2C, the processor 6 invokes a process to screen out those form rules that have been superseded and further check the rules that have not been superseded for path divergence. That process is now described with reference to FIGS. 2D-E. In FIG. 2D, the processor 6 begins with the set of form rules declared in the super-paradigm and it begins, in step 120, a loop to screen out those form rules that are: 1) superseded by the "current" (sub) paradigm, and 2) check the form rules to be inherited for divergent paths.

If, in step 120, there are unchecked form rules, the processor 6 proceeds in step 124 to get a form rule from the tree data structure of the compiled rule set 20. In step 126, the processor 6 checks the form rule against the form rules in the current paradigm branch of the tree to determine whether or not it has been superseded. Again, a form rule is superseded when a form rule with an identical left-hand-side is declared in a subparadigm. If the form rule has been superseded, the processor 6, in step 126, disregards the form rule (as it will not be shared) and proceeds, in step 120, to the beginning of the loop to process another form rule from the set of form rules of the superparadigm. And if all the form rules have been processed, the processor 6 will return to step 102 in FIG. 2C.

Referring again to step 126 in FIG. 2D, if the form rule currently being examined has not been superseded, the processor 6 will, in step 127, invoke a procedure to perform a divergence check on the form rule. FIG. 2E, described below, depicts an exemplary process flow for the divergence check process of step 127.

When a form rule is inherited, the super-paradigm form rule is "shared" by the sub-paradigms that inherit the rule. This is accomplished by: 1) associating with each shared form rule a list of the paradigms that share it; and 2) placing a reference to the shared form rule in each of the subparadigms that have inherited it. (The latter is an optimization for the function of morphological generation and aids in the divergence check). For example, associated with a super-paradigm form rule (pres.sub.-- 1s) that is inherited by a number of sub-paradigms, might be the following information:

    ______________________________________
    Shared by:            1,3,4
    Affix Type:           SUFFIX
    Operator:             +
    Affix Data Type:      String
    Affix:                "e"
    Stem:                 BASE2
    ______________________________________


where the "shared by" entry comprises a set of references to the paradigms that share this form rule. Referring to FIG. 2E, in step 130, the processor 6 adds the current paradigm to the list of sharing paradigms of the current form rule of the super-paradigm.

Next in step 132, of FIG. 2E, the processor 6 retrieves the stem of the ancestor form rule, and, in step 134, determines whether the stem value is a reference to another form rule. If not, no further divergence checking needs to be performed, and the processor 6 will, in step 136, return to where the divergence check was invoked (i.e., step 127 in FIG. 2D).

Referring again to step 134, if the stem is a reference to another form rule, the processor 6 proceeds to step 138 to determine if that stem form rule has been superseded anywhere between the original paradigm that is inheriting form rules, and the current ancestor. This determination is made by searching the original paradigm, and each one of its ancestors back to the current ancestor, for a form rule with an identical left-hand-side. If the form rule was not found in step 138, there is no divergence. The processor 6 then returns to step 130 to continue following the chain of form rules, processing the stem of that stem of the form rule. If a superseding form rule was found in step 138, the processor 6 proceeds to step 140 to determine if the original form rule being checked is a surface form rule. If so, then a cycle check procedure must be invoked.

As stated above (with reference to step 58 in FIG. 2A), a cycle check is a procedure to check the reference chain of the form rule, searching the form rules referenced in the right-hand-sides of the form rules, until a LEX indicator is found. In the present invention, a form rule cycle check could be performed at one of two different times.

Referring to FIG. 2A, in step 58, the processor 6 invokes the cycle check on the set of surface form rules declared in the "current paradigm". If the current paradigm is based on another paradigm, then that set of surface form rules represents a set of (surface) form rules that are superseding those with the same left-hand-side in the super-paradigm. Also, if the current paradigm is based on another paradigm, then prior to step 58 (FIG. 2A), that paradigm (in step 48, FIG. 2A) will have inherited form rules from its super-paradigm. In the course of inheriting form rules in step 48 (FIG. 2A), a check for divergence is performed as in step 58A (FIG. 2E). If divergence is in fact detected, as in step 138 (FIG. 2E) described above, then a cycle check must be performed on the path with the superseding intermediate form rule (if the original form rule is a surface form rule). If the processor 6 did not perform this cycle check, then there could be the situation in which, for example, the paradigm verb.sub.-- er shares (i.e., inherits) pres.sub.-- 1s, which is declared in the super-paradigm verb.sub.-- root (APPENDIX III) as follows:

pres.sub.-- 1s: BASE+"es"

where BASE is declared as:

BASE: inf-"ir".

Assuming that BASE was also declared in verb.sub.-- er (i.e. the latter paradigm), as follows:

BASE: pres.sub.-- 1s+"es"

that would constitute a cycle, or an illegal path. Therefore, all divergent paths must also be checked for cycles. When the processor 6 invokes the cycle checking procedure in step 58 (FIG. 2A), however, this same path will not be checked, because only paths (chains) originating at the current paradigm will be checked, whereas the paths checked in step 58A (FIG. 2E) originated in the super-paradigm, but then "diverged" to a form rule declared in the subparadigm.

Referring again to FIG. 2E, if, in step 140, the original form rule is not a surface form rule, the processor 6 proceeds to step 146, where it will add to the node corresponding to that stem in the tree data structure, a reference to the superseding form rule and subparadigm. Once the divergence is made in step 148, the processor 6 in step 148, returns to step 127 in FIG. 2D.

Referring again to step 140 in FIG. 2E, if the original form rule was a surface form rule, then a cycle check must be made in step 58A. As described above, the cycle check is a procedure to check the chain of form rule references in the form rule stems to determine whether each form rule chain of references resolves to a LEX indicator. The process for cycle checking is described in detail below with reference to FIG. 2G. Referring again to step 58A in FIG. 2E, if the cycle check procedure identifies a cycle, the processor 6, in step 144, will generate an error and proceed, in step 148, to return to step 127 in FIG. 2D.

Otherwise, if no cycle is found, the processor 6 will continue to step 146 and add to the stem's list of divergents, a reference to the superseding form rule and subparadigm. Like the form rule "sharing" information described above, diverging form rule information is also stored in the tree data structure as nodes associated with the relevant form rule stem. Where a stem form rule is superseded, the reference to the superseding form rule, and its paradigm, associated with the superseded stem of the super-paradigm form rule, provides a branch that will enable the processor 6, in the run-time phase, to follow the correct path for any given form rule in any paradigm. This maximizes the amount of form rule sharing, while allowing for branching at any point in the form rule reference chain.

Referring again to step 146, FIG. 2E, once the reference to the divergent form rule is placed in the stem, the processor 6 in step 148, returns to step 127 in FIG. 2D.

d. Cycle check and Feature Propagation

Cycle check processing, as described above, checks the chain of form rule references starting at each surface form rule to insure that each form rule chain eventually resolves to a LEX indicator, which signals that the form string is stored in the lexicon. In FIG. 2A, the cycle check process is invoked in step 52 during the processing of the form rules for each paradigm. The cycle check procedure is also invoked during the divergence checking process in step 52 at FIG. 2E to insure that where the path of an inherited form rule diverges, the path of divergence also does not contain a cycle.

FIG. 2F provides an exemplary process flow of the cycle check procedure of step 52. In addition to cycle checking, this procedure will also propagate features for all form rules having a + operator in the present invention. Processing begins in step 150 with the processor 6 taking a surface form rule as input and beginning to process each component of its right-hand-side (RHS). In step 150, if there is a RHS component to check, the processor 6, in step 154, will get the next component and proceed to step 156. In step 156, the processor 6 determines whether the component references a form rule. If so, the processor 6 next determines in step 158 whether or not the form rule reference has already been seen in this cycle check. To make this determination, the processor 6 will scan the entries in a form rule list 159: Initially, for each sail to the cycle propagation process, the form list will be empty. If, in step 158 the stem form rule reference has not been "seen", the processor 6, in step 162, will mark the rule as being seen by adding the reference to the form rule list 159. Proceeding to step 58B, the processor 6 will now follow the chain of form rule references in step 58B, recursively calling this cycle check procedure, passing the stem form rule as the form rule argument. If the recursive call in step 58B to the cycle check procedure discovers a cycle, the processor 6, in step 160, will end the cycle check and also return an error to step 52 in either FIG. 2A or 2D (depending on where the call was made).

Referring again to step 58B, in FIG. 2G, if the recursive cycle check returns no cycle error (i.e., the cycle check is false), the processor 6 proceeds to step 166 to determine whether the form rule operator occurring in the RHS of the surface form is a plus (+). If the operator is a plus, then the processor 6 will perform the function of feature propagation for the surface form being checked.

As stated above, the feature propagation process propagates features (corresponding to the morpho-syntactic properties of a word form) from the intermediate forms of a form and associates them with the surface form. This process gives the user of the M language syntax an added convenience because he or she need specify the grammatical features only once rather than over and over for each surface form rule that references a given intermediate rule. In step 168, the processor 6 propagates the form features associated with the intermediate form rule, adding them into the feature specification of the surface form rule.

Once the features of the component are propagated in step 168, the processor 6 returns to the beginning of the cycle check loop and step 150 to process another RHS component. If, in step 150, there is another component to process, the processor 6, in step 154, will retrieve that component and proceed to step 156. If the component is not a reference to a form rule, the processor 6 next proceeds to step 170, where it determines whether the component is a LEX indicator. When the processor 6 finds a LEX indicator, it has hit the end of the form rule reference chain and here the processor 6 takes the features associated with the form rule where the LEX indicator was encountered and passes them back to the caller of the cycle check procedure. Where the cycle check procedure was recursively invoked from step 58B (FIG. 2F), the form features will bubble back up the chain, each form rule adding its features into the feature specification that is being passed back up the chain, and which eventually gets passed to the original caller, which adds them to the feature specification of the surface form rule. If, in step 170, the processor 6 determines that the component is a LEX indicator, the processor 6 will proceed to propagate features associated with form rule, but only if the form rule operator is a plus. In step 166, the operator is checked, and in step 168, the processor 6 propagates these features by passing the feature specification back up to the caller.

From step 168, the processor 6 returns to step 150 to process another RHS component of the form rule if any remain. If none remain, the processor 6 returns, in step 152, to the calling step, setting a success flag to true, and cycle check processing is complete.

e. Orthographic Rule Conflation

Referring again to step 54 in FIG. 2A, an exemplary process flow for the process of orthographic rule "conflation" is depicted in FIG. 2G. The present invention provides a method by which a user can specify a description of a language allowing specification of orthographic rules separate from the specification of the grammatical paradigm. And in the compilation process, the present invention automatically applies the orthographic rules across the paradigms to the appropriate form rules, without the user having to specify the application.

Referring to FIG. 2G, the processor 6 in step 200, begins a loop to process each orthographic rule declared in the user prepared declarative rules 16 (FIG. 1). If there are orthographic rules to process, the processor 6, in step 204, attempts to locate a set of form rules which are described by or match the characteristics of the orthographic rule. To find a match, the processor 6 will compare the left-hand-side of the orthographic rule to the right-hand-side of the form rule attempting to match the affix type (i.e., prefix or suffix) operator (i.e., plus [+] or minus [-]) and affix string. For example, in step 204, assume that an orthographic rule has been declared:

    ______________________________________
    orthographic rules {
    -> [dental "et"] ["t"]
    ______________________________________


In step 204 the processor 6, may arrive at the following form rule:

imp.sub.-- 2s:IMP+"test"

where:

IMP:BASE

and

BASE:inf-"en"

It is possible that the orthographic rule might apply to the form rule imp.sub.-- 2s, because the affix in the form rule is a suffix, the operator is a plus and the first letter in the form rule affix is a "t", thus matching the "t" affix string in the orthographic rule. The processor 6 will attempt to match all form rules (both surface and intermediate) to the orthographic rule.

After the attempt to match all the form rules with the orthographic rule in step 204, the processor 6 next begins a loop in step 206 to "conflate" each of the matching form rules. If the processor 6 determines there are additional rules to conflate, it proceeds to step 208 and retrieves the form rule. To conflate the form rule, the processor 6 will first create an intermediate form rule called an inner variant form rule or "inner variant", that has the same stem as the matching form rule, but which uses as its affix the context of the orthographic rule and the minus operator. In the example above, the inner Variant form rule Created would be:

imp.sub.-- 2s *I*1:IMP-[dent]

where [dent] is the context of the orthographic rule.

In step 212, the processor 6 next proceeds to create an additional form rule variant called an "outer variant". In the exemplary embodiment of the present invention, only the outer variant is associated wi