Method and apparatus for morphological analysis and generation of natural language text5794177Abstract This invention improves information retrieval and the precision of language processing by providing an apparatus and method for organizing, utilizing, analyzing, and generating morphological data. The apparatus and method involve locating a stored lexical expression representative of a candidate word found in a stream of natural language text, identifying a paradigm for the candidate word based upon the stored lexical expression, and applying transforms contained within the identified paradigm to the candidate word. Claims Having described the invention, what is claimed as new and secured by Letters Patent is: Description BACKGROUND OF THE INVENTION
__________________________________________________________________________
baseform part-of-speech tag first character string to strip from the
candidate
word .fwdarw.
second character string to add to the candidate word part-of-speech
tag of
morphological transform
›optional field for prefixation!.
__________________________________________________________________________
Each morphological transform can thus be described as containing a number of functional elements listed in sequence, as shown in FIG. 4C. In particular, the first functional element specifies the part-of-speech tag of the baseform of the candidate word, and the second functional element identifies the suffix to strip from the candidate word to form an intermediate baseform. The third functional element identifies the suffix to add to the intermediate baseform to generate the actual baseform, and the fourth functional element specifies the part-of-speech of the morphological transform. The fifth functional element is an optional element indicating whether prefixation occurs. FIG. 4C illustrates, in particular, a morphological file suited to inflection and uninflection. For example, inflection transform 001 (as identified by column 73) contains three transformations shown in columns 75, 77 and 79, respectively. The column 75 transformation for inflection transform 001 contains the transform, VB.sub.-- .fwdarw.d.sub.-- VBN. This transform contains rules specifying that: (1) the baseform part-of-speech is VB; (2) no suffix is to be stripped from the candidate word to form the intermediate baseform; (3) the suffix d is to be added to the intermediate baseform to generate the actual baseform; (4) the part-of-speech of the resulting inflected form is VBN; and (5) no prefixation occurs. The column 79 transformation for transform 001 contains the transform VB.sub.-- e.fwdarw.ing.sub.-- VBG. This transform specifies: (1) the baseform part-of-speech is VB; (2) the suffix e is to be stripped from the candidate word to form the intermediate baseform; (3) the suffix ing is to be added to the intermediate baseform to generate the actual baseform; (4) the part-of-speech of the resulting inflected form is VBG; and (5) no prefixation occurs. A file similar to that shown in FIG. 4C can be constructed for derivation expansion and underivation (derivation reduction). A derivational file, however, will not contain a functional element in the transform identifying part-of-speech information used in specifying whether a candidate word is a derivation or a derivational baseform. Information regarding derivation baseforms is instead stored in the word data table 31 of FIG. 3 under the Is Derivation Field 38. Morphological file 71 of FIG. 4C also illustrates the use of portmanteau paradigms. Portmanteau paradigms provide a structure capable of mapping the morphological changes associated with words having complicated morphological patterns. In particular, morphological transforms 133, 134, 135, 136 and 137 (as identified in column 73) contain portmanteau paradigm used for associating a plurality of paradigms with any particular candidate word. Morphological transform 133 indicates that patterns "006" and "002", as identified in column 73, are used to inflect the candidate word associated with morphological transform 133. Accordingly, a candidate word associated with inflection transform 133 becomes further associated with inflection transforms 002 and 006. For instance, the portmanteau paradigm 133 identifies the two inflections of travel, that can be inflected as travelled and traveled, depending upon dialect. Portmanteau paradigm 133 can also be used to inflect install, which can also be spelled instal. The illustrated portmanteau paradigms illustrate one possible structure used for applying multiple paradigms to any particular candidate word. Another possible structure for providing portmanteau paradigms can be formed using word data table 31 and a representative entry 33, as shown in FIG. 3. For example, expression N.sub.2 in data table 31 points to a representative entry 33 having a noun inflection pattern 46, a verb inflection pattern 48, and an adjective/adverb inflection pattern 50. In addition, the patterns 46, 48, and 50 each point to a paradigm in a morphological file 71, as illustrated in FIG. 4C. Thus, a candidate word matched with the expression N.sub.2 can become associated with a plurality of paradigms. FIG. 4C illustrates a further aspect of the invention wherein the applicants' system departs dramatically from the prior art. In particular, a morphological baseform in accordance with the invention can vary in length and does not need to remain invariant. By utilizing baseforms of variable length, the invention removes many of the disadvantages associated with earlier natural language processing techniques, including the need for a large exception dictionary. The morphological file 71 includes transforms having a variable length baseform, such as paradigm numbers 001 and 004. For example, the column 75 and 77 transforms of paradigm 001 produce a baseform having no characters removed from the candidate word while the column 79 transform of paradigm 001 produces a baseform having an e character removed. The column 75 transform of paradigm 004 produces a baseform having no characters removed while the column 77 and 79 transforms of paradigm 004 produce baseforms having ay character removed from the candidate word. Thus, when processor 30 acts in accordance with the instructions of paradigms 001 or 004 to form all possible baseforms of a candidate word, the processor will form baseforms that vary in length. FIG. 5 illustrates a database system stored in various portions of memory elements 14 and 22 showing a connection between tables 31, 62, and 70 for associating part-of-speech tags with various lexical expressions contained within a stream of text. An Expression N.sub.2 contained within the stream of text can be identified in the word data table 31 as representative entry 33. Representative entry 33 encodes the information contained in a 32-byte prefix, of which bytes 16-18 contain a code found in the part-of-speech combination table 62. This table in its turn relates this particular part-of-speech combination with index 343 in table 62, thereby associating the part-of-speech tags of ABN (pre-qualifier), NN (noun), QL (qualifying adverb), and RB (adverb) with Expression N.sub.2. In accordance with a further aspect of the invention, a part-of-speech tag can be associated with an expression in the stream of text through the use of suffix table 70. For example, a first expression in stream of text might contain a suffix ole, and can be identified in suffix table 70 as representative entry 63. A second expression in the stream of text might contain the suffix ole, and can be identified in suffix table 70 as representative entry 65. The pointer in representative entry 63 points to index 1 in table 62, and the pointer in representative entry 65 points to index 1 in table 62. Thus, both the first and second expression in the stream of text become associated with the part-of-speech tag of NN. FIG. 6 shows a block diagram of a noun-phrase analyzer 13 for identifying noun phrases contained within a stream of natural language text. The analyzer 13 comprises a tokenizer 43, a memory element 45, and a processor 47 having: a part-of-speech identifier 49, a grammatical feature identifier 51, a noun-phase identifier 53, an agreement checker 57, a disambiguator 59, and a noun-phrase truncator 61. Internal connection lines are shown both between the tokenizer 43 and the processor 47, and between the memory element 45 and the processor 47. FIG. 6 further illustrates an input line 41 to the tokenizer 43 from the application program interface 11 and an output line from the processor 47 to the application program interface 11. Tokenizer 43 extracts tokens (i.e., white-space delimited strings with leading and trailing punctuation removed) from a stream of natural language text. The stream of natural language text is obtained from text source 16 through the application program interface 11. Systems capable of removing and identifying white-space delimited strings are known in the art and can be used herein as part of the noun-phrase analyzer 13. The extracted tokens are further processed by processor 47 to determine whether the extracted tokens are members of a noun phrase. As illustrated in FIGS. 7A-7I, tokenizer 43 can comprise a system for extracting lexical matter from the stream of text and a system for filtering the stream of text. Tokenizer 43 receives input from input line 41 in the form of a text stream consisting of alternating lexical and non-lexical matter; accordingly, lexical tokens are separated by non-lexical matter. Lexical matter can be broadly defined as information that can be found in a lexicon or dictionary, and is relevant for Information Retrieval Processes. Tokenizer 43 identifies the lexical matter as a token, and assigns the attributes of the token into a bit map. The attributes of the non-lexical matter following the lexical token are mapped into another bit map and associated with the token. Tokenizer 43 can further tag or identify those tokens that are candidates for further linguistic processing. This filtering effect by the tokenizer 43 reduces the amount of data processed and increases the overall system throughput. This implementation of tokenizer 28 has several benefits. It achieves high throughput; it generates information about each token during a first pass across the input stream of text; it eliminates and reduces multiple scans per token; it does not require the accessing of a database; it is sensitive to changes in language; and it generates sufficient information to perform sophisticated linguistic processing on the stream of text. Moreover, tokenizer 28 allows the non-lexical matter following each token to be processed in one call. Additionally, tokenizer 28 achieves these goals while simultaneously storing the properties of the non-lexical string in less space than is required to store the actual string. Memory element 45, as illustrated in FIG. 5, can be a separate addressable memory element dedicated to the noun-phrase analyzer 13, or it can be a portion of either internal memory element 22 or external memory element 14. Memory element 5 provides a space for storing digital signals being processed or generated by the tokenizer 43 and the processor 47. For example, memory element 14 can store tokens generated by tokenizer 43, and can store various attributes identified with a particular token by processor 47. In another aspect of the invention, memory element 14 provides a place for storing a sequence of tokens along with their associated characteristics, called a window of tokens. The window of tokens is utilized by the processor to identify characteristics of a particular candidate token by evaluating the tokens surrounding the candidate token in the window of extracted tokens. Processor 47, as illustrated in FIG. 6, operates on the extracted tokens with various modules to form noun phrases. These modules can be hard-wired digital circuitry performing functions or they can be software instructions implemented by a data processing unit performing the same functions. Particular modules used by processor 47 to implement noun-phrase analysis include modules that: identify the part-of-speech of the extracted tokens, identify the grammatical features of the extracted tokens, disambiguate the extracted tokens, identify agreement between extracted tokens, and identify the boundaries of noun phrases. FIG. 8 depicts a processing sequence of noun-phrase analyzer 13 for forming noun phrases that begins at step 242. At step 243, the user-specified options are input to the noun-phrase analysis system. In particular, those options identified by the user through an input device, such as keyboard 18, are input to text processor 10 and channeled through the program interface 11 to the noun-phrase analyzer 13. The user selected options control certain processing steps within the noun-phrase analyzer as detailed below. At step 244, the user also specifies the text to be processed. The specified text is generally input from source text 16, although the text can additionally be internally generated within the digital computer 12. The specified text is channeled through the application program interface 11 to the noun-phrase analyzer 13 within the Buffer Block. Logical flow proceeds from box 244 to box 245. At action box 245 tokenizer 43 extracts a token from the stream of text specified by the user. In one embodiment, the tokenizer extracts a first token representative of the first lexical expression contained in the stream of natural language text and continues to extract tokens representative of each succeeding lexical expression contained in the identified stream of text. In this embodiment, the tokenizer continues extracting tokens until either a buffer, such as memory element 45, is full of the extracted tokens or until the tokenizer reaches the end of the text stream input by the user. Thus, in one aspect the tokenizer extracts tokens from the stream of text one token at a time while in a second aspect the tokenizer tokenizes an entire stream of text without interruption. Decision box 246 branches logical control depending upon whether or not three sequential tokens have been extracted from the stream of text by tokenizer 43. At least three sequential tokens have to be extracted to identify noun phrases contained within the stream of text. The noun-phrase analyzer 13 is a contextual analysis system that identifies noun phrases based on a window of token containing a candidate token and at least one token preceding the candidate token and one token following the candidate token in the stream of text. If at least three tokens have not yet been extracted, control branches back to action box 245 for further token extraction, while if three tokens have been extracted logical flow proceeds to decision box 247. At decision box 247 the system identifies whether the user-requested disambiguation of the part-of-speech of the tokens. If the user has not requested part-of-speech disambiguation control proceeds to action box 249. If the user has requested part-of-speech disambiguation, the logical control flow proceeds to decision box 248 wherein the system determines whether or not disambiguation can be performed. The noun-phrase analyzer 13 disambiguates tokens within the stream of natural language text by performing further contextual analysis. In particular, the disambiguator analyzes a window of at most four sequential tokens to disambiguate part-of-speech of a candidate token. In one aspect the window of token contains the two tokens preceding an ambiguous candidate token, the ambiguous candidate token itself, and a token following the ambiguous candidate token in the stream of text. Thus, in accordance with this aspect, if four sequential tokens have not been extracted logical flow branches back to action box 245 to extract further tokens from the stream of text, and if four sequential tokens have been extracted from the stream of text logical flow proceeds to action box 249. At action box 249, the part-of-speech identification module 49 of processor 47 determines the part-of-speech tags for tokens extracted from the stream of text. The part-of-speech tag for each token can be determined by various approaches, including: table-driven, suffix-matching, and default tagging methods. Once a part-of-speech tag is determined for each token, the part-of-speech tag becomes associated with each respective token. After step 249, each token 21 in token list 17 preferably contains the most probable part-of-speech tag and contains a pointer to an address in a memory element containing a list of other potential part-of-speech tags. In accordance with the table driven aspect of the invention, the part-of-speech tag of a token can be determined using the tables shown in FIGS. 3-5. For example, a representative lexical expression equivalent to the extracted token can be located in the word data table 31 of FIG. 2. As shown in FIG. 2-FIG. 5, module 49 can then follow the pointer, contained in bytes 16-18 of the representative expression in word table 31, to an index 64 in the part-of-speech combination table 62. The index 64 allows module 49 to access a field 66 containing one or more part-of-speech tags. Module 49 at processor 47 can then retrieve these part-of-speech tags or store the index to the part-of-speech tags with the extracted token. This table-driven approach for identifying the part-of-speech tags of extracted words advantageously provides a fast and efficient way of identifying and associating parts-of-speech with each extracted word. The word data table and the POS Combination Table further provide flexibility by providing the system the ability to change its part-of-speech tags in association with the various language databases. For example, new tables can be easily downloaded into external memory 14 or memory 22 of the noun-phrase system without changing any other sections of the multilingual text processor 10. In accordance with the suffix-matching aspect of the invention, the part-of-speech tag of a token can be determined using the tables shown in FIGS. 4-5. For example, module 49 at processor 47 can identify a representative suffix consisting of the last end characters of the extracted token in suffix table 70 of FIG. 4B. Once a matching suffix is identified in suffix table 70, module 49 can follow the pointer in column 74 to an index 64 in part-of-speech combination table 62. The index 64 allows module 49 to access a field 66 containing one or more part-of-speech tags. The index 64 allows module 49 to access a field 66 containing one or more part-of-speech tags. The part-of-speech identification module 49 can then retrieve these part-of-speech tags or store the index to the part-of-speech tags with the extracted token. Generally, the suffix-matching method is applied if no representative entry in the word data table 31 was found for the extracted token. A second alternative method for identifying the part-of-speech tags for the token involves default tagging. Generally, default tagging is only applied when the token was not identified in the word data table 31 and was not identified in suffix table 70. Default tagging associates the part-of-speech tag of NN (noun) with the token. As a result, at the end of step 249 each token has a part-of-speech tag or part-of-speech index that in turn refers to either single or multiple part-of-speech tags. After step 249, logical control flows to action box 250. At action box 250, the grammatical feature identification module 51 of the processor 9 determines the grammatical features for the tokens 21 contained in the token list 17. The grammatical features for each token can be obtained by identifying a representative entry for the token in the word data table 31 of FIG. 3. The identified representative entry contains information pertaining to the grammatical features of the word in fields 32, 34, 36, 38, 40, 42, 46, 48, 50, 52, 54, 56, 58 and 60. These fields in the representative entry either contain digital data concerning the grammatical features of the token, or point to an address in a memory element containing the grammatical features of the token. After box 250, control proceeds to decision box 251. Decision box 251 queries whether the user requested disambiguation of the part-of-speech tags. If disambiguation was requested, control proceeds to action box 252. If disambiguation was not requested, control proceeds to action box 253. At action box 252, the part-of-speech tags of ambiguous tokens are disambiguated. The disambiguator module 59 of the processor 47 identifies tokens having multiple part-of-speech tags as ambiguous and disambiguates the identified ambiguous tokens. Accordingly, action box 252 disambiguates those tokens identified as having multiple part-of-speech tags. For example, a first token extracted from the stream of text can be identified in the word data table 31 and thereby have associated with the first token an index 64 to the part-of-speech combination table 62. Furthermore, this index 64 can identify an entry having multiple part-of-speech tags in column 66 of table 62. Thus, the first token can be associated with multiple part-of-speech tags and be identified as ambiguous by processor 47. Preferably, the first listed part-of-speech tag in table 62, called a primary part-of-speech tag, is the part-of-speech tag having the highest probability of occurrence based on frequency of use across different written genres and topics. The other part-of-speech tags that follow the primary part-of-speech tag in column 66 of table 62 are called the secondary part-of-speech tags. The secondary part-of-speech tags are so named because they have a lower probability of occurrence than the primary part-of-speech tag. The disambiguator can choose to rely on the primary part-of-speech tag as the part-of-speech tag to be associated with the ambiguous token. However, to ensure accurate identification of the part-of-speech for each token, this probabilistic method is not always reliable. Accordingly, in a preferred aspect, the invention provides for a disambiguator module 59 that can disambiguate those tokens having multiple part-of-speech tags through contextual analysis of the ambiguous token. In particular, disambiguator 59 identifies a window of sequential tokens containing the ambiguous token and then determines the correct part-of-speech tag as a function of the window of sequential tokens. In a first embodiment, the window of sequential tokens can include, but is not limited to, the two tokens immediately preceding the ambiguous token and the token immediately following the ambiguous token. In a second embodiment, the window of sequential tokens includes the ambiguous token, but excludes those classes of tokens not considered particularly relevant in disambiguating the ambiguous token. One class of tokens considered less relevant in disambiguating ambiguous tokens include those tokens having part-of-speech tags of either: adverb; qualifying adverb; or negative adverbs, such as never and not. This class of tokens is collectively referred to as tokens having "ignore tags". Under the second embodiment, for example, the disambiguator module 59 forms a window of sequential tokens containing will run after skipping those words having ignore tags in the following phrases: will run; will frequently run; will very frequently run; will not run; and will never run. The second embodiment thus ensures, by skipping or ignoring a class of irrelevant tokens, an accurate and rapid contextual analysis of the ambiguous token without having to expand the number of tokens in the window of sequential tokens. Moreover, a window of four sequential tokens ranging from the two tokens immediately preceding the ambiguous token and the token immediately following the ambiguous token can be expanded to include additional tokens by: (1) skipping those tokens contained within the original window of four sequential tokens that have ignore tags, and (2) replacing the skipped tokens with additional sequential tokens surrounding the ambiguous token. The functions or rules applied by module 59 identify the most accurate part-of-speech of the ambiguous token based both upon the window of sequential tokens containing the ambiguous token and the characteristics associated with those tokens contained within the window of tokens. The characteristics associated with the tokens include, either separately or in combination, the part-of-speech tags of the tokens and the grammatical features of the tokens. Once the disambiguator module 59 of the processor 47 has identified the most accurate part-of-speech tag, the processor places this part-of-speech tag in the position of the primary part-of-speech tag, i.e., first in the list of the plurality of part-of-speech tags associated with the ambiguous token. Thus, the ambiguous target token remains associated with a plurality of part-of-speech tags after the operations of processor 47, but the first part-of-speech tag in the list of multiple part-of-speech tags has been verified as the most contextually accurate part-of-speech tag for the ambiguous token. In one aspect, disambiguator 59 can determine that no disambiguation rules apply to the ambiguous token and can thus choose to not change the ordering of the plurality of part-of-speech tags associated with the ambiguous token. For example, a token having multiple part-of-speech tags has at least one part-of-speech tag identified as the primary part-of-speech tag. The primary part-of-speech tag can be identified because it is the first part-of-speech tag in the list of possible part-of-speech tags, as illustrated in FIG. 4A. If the disambiguator 59 determines that no disambiguation rules apply, the primary part-of-speech tag remains the first part-of-speech tag in the list. In a further aspect, a disambiguation rule can be triggered and one of the secondary part-of-speech tags can be promoted to the primary part-of-speech tag. In accordance with another aspect, a disambiguation rule is triggered and the primary part-of-speech tag of the ambiguous token is coerced into a new part-of-speech tag, not necessarily found amongst the secondary part-of-speech tags. An additional aspect of the invention provides for a method wherein a disambiguation rule is triggered but other conditions required to satisfy the rule fail, and the primary part-of-speech tag is not modified. Thus, after disambiguating, each token has a highly reliable part-of-speech tag identified as the primary part-of-speech tag. FIG. 9 illustrates an exemplary rule table used for disambiguating an extracted token in the English language. As discussed with respect to the tables illustrated in FIG. 3-FIG. 5, the disambiguation tables can differ from language to language. Advantageously, the tables can be added to the system 10 or removed from the system 10 to accommodate various languages without modifying the source code or hardware utilized in constructing the multilingual text processor 10 in accordance with the invention. The illustrated table contains: (1) a column of rules numbered 1-6 and identified with label 261; (2) a column representing the ambiguous token ›i! and identified with label 264; (3) a column representing the token ›i+1! immediately following the ambiguous token and identified with label 266; (4) a column representing the token ›i-1! immediately preceding the ambiguous token and identified with the label 262; and (5) a column representing the token ›i-2! immediately preceding the token ›i-1! and identified with the label 260. Accordingly, the table illustrated in FIG. 9 represents a group of six disambiguation rules that are applied by disambiguator 59, as part of the operations of the processor 47, to a window of sequential tokens containing the ambiguous token ›i!. In particular, each rule contains a set of requirements in columns 260, 262, 264, and 266, which if satisfied, cause the primary part-of-speech of the ambiguous token to be altered. In operation, processor 47 sequentially applies each rule to an ambiguous token in the stream of text and alters the primary part-of-speech tag in accordance with any applicable rule contained within the table. For example, rule 1 has a requirement and result labeled as item 268 in FIG. 9. In accordance with rule 1, the processor 47 coerces the primary part-of-speech tag of the ambiguous token to NN (singular common noun) if the ambiguous token ›i! is at the beginning of a sentence and has a Capcode greater than 000 and does not have a part-of-speech tag of noun. Rules 2-6, in FIG. 9, illustrate the promotion of a secondary part-of-speech tag to the primary part-of-speech tag as a function of a window of token surrounding the ambiguous token ›i!. In particular, rule 2 promotes the secondary part-of-speech tag of singular common noun to the primary part-of-speech tag if: the token ›i-2! has a primary part-of-speech tag of article, as shown by entry 270; the token ›i! has a primary part-of-speech tag of either verb or second possessive pronoun or exclamation or verb past tense form, as shown by entry 272; and the token ›i! has a secondary part-of-speech tag of singular common noun, as shown by entry 272. Rule 3 promotes the secondary part-of-speech tag of singular common noun to the primary part-of-speech tag if: the token ›i-1! has a part-of-speech tag of verb infinitive or singular common noun, as shown by entry 274; and the token ›i! has a primary part-of-speech tag of verb or second possessive pronoun or exclamation or verb past tense form and has a secondary part-of-speech tag of singular common noun, as shown by entry 276. Rule 4 promotes the secondary part-of-speech tag of singular common noun to the primary part-of-speech tag if: the token ›i-1! has a part-of-speech tag of modal auxiliary or singular common noun, as shown by entry 278; the token ›i! has a primary part-of-speech tag of modal auxiliary and has a second part-of-speech tag of singular common noun, as shown by entry 280; and the token ›i+1! has a part-of-speech tag of infinitive, as shown by entry 282. FIG. 9 thus illustrates one embodiment of the invention wherein the disambiguator 59 of the processor 47 modifies the ambiguous target token in accordance with a rule table. In particular, the illustrated rule table instructs processor 47 to modify the part-of-speech tags of the ambiguous token as a function of: the two tokens preceding the ambiguous target token in the stream of text, the token following the ambiguous target token in the stream of text, and the ambiguous target token itself. FIG. 9 further illustrates an embodiment wherein the ambiguous target token is modified as a function of the primary part-of-speech tag and the secondary part-of-speech tags of the ambiguous target token, and the part-of-speech tags of the other token surrounding the target token. Disambiguation step 252 can also provide for a system that aids in identifying the elements of a noun phrase by checking whether or not the tokens in the stream of natural language text agree in gender, number, definiteness, and case. In particular, processor 47 can validate agreement between a candidate token and a token immediately adjacent (i.e., either immediately preceding or immediately following) the candidate token in the stream of text. Agreement analysis prior to step 253, wherein the noun phrase is identified, operates in a single match mode that returns a success immediately after the first successful match. Thus, if agreement is being tested for token ›i! and token ›i-1! in the single match mode, processing stops as soon as a match is found. In accordance with this process, the processor selects the first part-of-speech tag from token ›i!, and tries to match it with each tag for the token ›i-1! until success is reached or all of the part-of-speech tags in token ›i-1! are exhausted. If no match is found, then the processor 47 tries to match the next part-of-speech tag in the token ›i! with each tag in token ›i-1! until success is reached or all of the part-of-speech tags in token ›i-1! are exhausted. This process continues until either a match is reached, or all of the part-of-speech tags in both token ›i! and token ›i-1! have been checked with each other. A successful agreement found between two tokens indicates that the two tokens are to be treated as part of a noun phrase. If no agreement is found, then the two tokens are not considered to be a part of the same noun phrase. First, the first POS tag from each token in checked for agreement.
______________________________________
Agreement Tags
Agreement Tags
Agreement Tags
______________________________________
i-1 Plural, Masculine
Singular, Masculine
i Singular, Feminine
Singular, Masculine
Plural, Masculine
______________________________________
(Tag1 & Tag2 & Number Map)
& (Tag1 & Tag2 & GenderMap)
fails fails
______________________________________
If this fails, the second POS tag from the token ›i-1! is checked for a match:
______________________________________
Agreement Tags
Agreement Tags
Agreement Tags
______________________________________
i-1 Plural, Masculine
Singular, Masculine
i Singular, Feminine
Singular, Masculine
Plural, Masculine
______________________________________
(Tag1 & Tag2 & Number Map)
& (Tag1 & Tag2 & GenderMap)
passes fails
______________________________________
At this point, all of the POS maps in the token ›i-1! have been exhausted, and no successful match has been bound. The second POS tag in the token ›i! must now be compared with all of the POS tags in the token ›i-1!. The first POS tag from the token ›i-1! and the second tag from the token ›i! are checked for a match:
______________________________________
Agreement Tags
Agreement Tags
Agreement Tags
______________________________________
i-1 Plural, Masculine
Singular, Masculine
i Singular, Feminine
Singular, Feminine
Plural, Masculine
______________________________________
(Tag1 & Tag2 & Number Map)
& (Tag1 & Tag2 & GenderMap)
fails passes
______________________________________
If it fails, the second POS tag from the token ›i-1! is checked for agreement:
______________________________________
Agreement Tags
Agreement Tags
Agreement Tags
______________________________________
i-1 Plural, Masculine
Singular, Masculine
i Singular, Feminine
Singular, Masculine
Plural, Masculine
______________________________________
(Tag1 & Tag2 & Number Map)
& (Tag1 & Tag2 & GenderMap)
passes passes
______________________________________
At this point, a match has successfully been made, and all agreement processing stops. The two tokens agree and Single Match mode processing is complete. After Step 252, logical flow proceeds to Step 253. At step 253, the noun-phrase identifier module 53 of processor 47 identifies the boundaries of noun phrases contained within the stream of natural language text, and marks those tokens forming the noun phrase. In particular, processor 47 identifies the noun-phrase boundaries through contextual analysis of each extracted token in the stream of text. In addition, module 53 marks those tokens forming the noun phrase by tagging tokens contained within the noun phrase. For example, module 53 can associate with: the first token in the noun phrase a tag indicating "the beginning" of the noun phrase; the last token in the noun phrase a tag indicating "the end" of the noun phrase; and those tokens found between the first and last tokens in the noun phrase a tag indicating "the middle" of the noun phrase. Thus, module 53 of processor 47 identifies those tokens that it determines are members of a noun phrase as either "the beginning", "the middle", or "the end" of the noun phrase. According to one aspect of the invention, the noun-phrase identifier module 53 processor 47 forms a window of sequential tokens to aid in identifying members of a noun phrase. Further in accordance with this aspect, the window of sequential tokens includes a token currently undergoing analysis, called a candidate token, and tokens preceding and following the candidate token in the stream of text. Preferably, the window of tokens includes the candidate token and one token immediately following the candidate token in the stream of text and one token immediately preceding the candidate token in the stream of text. Thus, the window contains at least three extracted tokens ranging from the token preceding the candidate token to the token following the candidate token inclusive. This window of sequential tokens provides a basis for contextually analyzing the candidate token to determine whether or not it is a member of a noun phrase. The module 53 analyses characteristics of the window of sequential tokens to determine whether the candidate token is a member of a noun phrase. The characteristics analyzed by processor 47 include, either separately or in conjunction, the part-of-speech tags and the grammatical features of each of the tokens contained within the window of tokens. Module 53 of processor 47 contextually analyzes the candidate token by applying a set of rules or functions to the window of sequential tokens surrounding the candidate token, and the respective characteristics of the window of sequential tokens. By applying these rules, module 53 identifies those candidate tokens which are members of noun phrases contained within the stream of text. The noun-phrase identification rules are a set of hard-coded rules that define the conditions required to start, continue, and terminate a noun phrase. In general, noun phrases are formed by concatenating together two or more contiguous tokens having parts-of-speech functionally related to nouns. Those parts-of-speech functionally related to nouns include the following parts-of-speech: singular common noun (NN), adjective (JJ), ordinal number (ON), cardinal number (CD). In one embodiment, the noun-phrase rules apply these concepts and form noun phrases from those sequential tokens having parts-of-speech functionally related to nouns. Thus, for example, a set of four rules in pseudocode for identifying noun phrase is set forth in Table I below.
TABLE I
______________________________________
1 If the token is a member of Noun Phrase Tags
2 start to form a Noun Phrase.
3 If the token is a stop list noun or adjective
4 If the Noun-phrase length is 0
5 don't start the Noun Phrase
6 else
7 break the Noun Phrase.
8 If the token is a lowercase noun AND
9 the following token is an uppercase noun
10 break the Noun Phrase.
11 If the token is a member of Noun-phrase Tags
12 continue the Noun Phrase.
______________________________________
In Table I, lines 1-2 represent a first rule and provide for identifying as a "beginning of a noun phrase" those candidate tokens having a part-of-speech tag functionally related to noun word forms. That is, the first rule tags as the beginning of a noun phrase those tokens having a part-of-speech tag selected from the group of part-of-speech tags, including: singular common noun, adjective, ordinal number, cardinal number. Lines 3-7, in Table I, represent a second rule. The second rule provides for identifying as an "end of the noun phrase" those candidate tokens having a part-of-speech tag selected from the group consisting of stoplist nouns and adjectives. The default implementation of the second rule contains the two stoplist nouns (i.e., one and ones) and one stoplist adjective (i.e., such). In particular applications, however, the user may introduce user-defined stoplist nouns and adjectives. For example, a user may chose to treat semantically vague generic nouns such as use and type as stoplist nouns. In addition, lines 8-10 represent a third rule. This third rules specifies that module 53 of processor 47 is to identify as an "end of the noun phrase" those selected tokens having a part-of-speech tag of noun and having a Capcode Field identification of "000" (i.e., lowercase), when the selected token is followed by an extracted token having a part-of-speech tag of noun and having a Capcode Field identification of "001" (initial uppercase) or "010" (i.e., all uppercase). Thus, in general, the third rule demonstrates identifying the end of a noun phrase through analysis of a group of tokens surrounding a candidate token, and the third rule demonstrates identifying the end of a noun phrase through analysis of the part-of-speech tags and grammatical features of tokens in the window of sequential tokens. The fourth rule, represented by lines 11-12 in Table I, provides for identifying as a "middle of the noun phrase" those selected tokens having a part-of-speech tag functionally related to noun word forms and following an extracted token identified as part of the noun phrase. For example, a token having a part-of-speech tag functionally related to noun word forms and following a token that has been identified as the beginning of the noun phrase is identified as a token contained within the middle of the noun phrase. In operation, module 53 in conjunction with processor 47 applies each rule in Table I to each token extracted from the stream of natural language text. These rules allow module 53 to identify those tokens which are members of a noun phrase, and the relative position of each token in the noun phrase. The rules illustrated in Table I are not language-specific. However, other tables exist which contain language-specific rules for identifying noun phrases. Table II-VI, as set forth below, contain language-specific rules.
TABLE II
______________________________________
English Language Noun-Phrase Rules
______________________________________
1 If the token is uppercase AND
2 the token has a Part-of-speech Tag of Singular Adverbial Noun AND
3 the preceding token is a noun
4 break the Noun Phrase
5 If the token is an adjective AND
6 the preceding token is a non-possessive noun
7 break the Noun Phrase
8 If the token is "of" or "&" AND
9 the preceding token is an uppercase noun AND
10 the following token is an uppercase noun
11 form a Noun Phrase starting with the preceding token and
12 continue the Noun Phrase as long as Noun Phrase Tags are
13 encountered.
______________________________________
Table II contains a group of rules, in pseudocode, specific to the English language. For example, lines 1-4 specify a first rule for identifying the end of a noun phrase, lines 5-7 recite a second rule for identifying the end of a noun phrase, and lines 8-13 specify a third rule for identifying the beginning and for identifying the middle of a noun phrase.
TABLE III
______________________________________
German Language Noun-Phrase Rules
______________________________________
1 If the token is an adjective AND
2 the preceding token is a noun AND
3 the following token is a member of Noun Phrase Tags
4 break the Noun Phrase
______________________________________
Table III contains a group of rules, in pseudocode, specific to the German Language. For example, lines 1-4 specify a rule for identifying the end of a noun phrase.
TABLE IV
______________________________________
Italian Language Noun-Phrase Rules
______________________________________
1 If the token is "di" AND
2 the preceding token is a noun AND
3 the following token is a lowercase noun
4 form a Noun Phrase starting with the preceding token and
5 continue the Noun Phrase as long as Noun Phrase Tags are
6 encountered.
______________________________________
Table IV contains a group of rules, in pseudocode, specific to the Italian Language. for example, lines 1-6 specify a rule for identifying the end of a noun phrase.
TABLE V
______________________________________
French and Spanish Noun Phrase Rules
______________________________________
1 If the token is "de" AND
2 the preceding token is a noun AND
3 the following token is a lowercase noun
4 form a Noun Phrase starting with the preceding token and
continue
5 Noun Phrase as long as Noun Phrase Tags are encountered.
______________________________________
Table V contains a group of rules, in pseudocode, specific to the French and Spanish Languages. For example, lines 1-5 recite a rule for identifying the beginning and the middle of a noun phrase.
TABLE VI
______________________________________
French and Spanish and Italian Noun-Phrase Rules
______________________________________
1 If the token is an adjective AND
2 the preceding token is a noun AND
3 the following token is a noun
4 break the Noun Phrase
______________________________________
Table VI contains a group of rules, in pseudocode, specific to the French and Spanish and Italian languages. For example, lines 1-4 recite a rule for identifying the end of a noun phrase. After action box 253 of FIG. 8, control proceeds to decision box 254 of FIG. 8. At decision box 254 the processor 47 identifies whether the user requested application of the agreement rules to the noun phrase identified in action box 253. If the user did not request application of the agreement rules, control branches to decision box 256. If the user did request application of the agreement rules, logical control proceeds to action box 255 wherein the agreement rules are applied. At action box 255 the agreement checking module 57 of the processor 47 ensures that the tokens within the identified noun phrase are in agreement. Although English has no agreement rules, other languages such as German, French and Spanish require agreement between the words contained within a noun phrase. For example, French and Spanish require gender and number agreement within the noun phrase, while German requires gender, number, and case agreement within the noun phrase. The grammatical features concerning gender, number, and case agreement are supplied by the grammatical feature fields of the word data table. FIG. 10 illustrates a pseudocode listing that processor 47 executes to ensure agreement between the various members contained within an identified noun phrase. In particular, processor 47 iteratively checks whether a first identified part of a noun phrase agrees with a second identified part of the noun phrase that immediately follows the first identified part in the stream of text. As described below, processor 47 ensures that each particular extracted token within the noun phrase agrees with all other extracted tokens contained in the noun phrase. Pictorially, given a series of tokens with their associated agreement tags as shown below, where all tokens shown are valid candidates for being in the noun phrase, it would be possible to form a noun phrase that started with the token ›i-2! and continued to the token ›i+1! because they all agree with respect to the agreement tags of "Singular, Feminine".
______________________________________
Agreement Tags Agreement Tags
Agreement Tags
______________________________________
i-2 Plural, Masculine
Singular, Masculine
Singular, Feminine
i-1 Plural, Masculine
Singular, Feminine
Plural, Feminine
i Singular, Feminine
Singular, Masculine
Plural, Masculine
i+1 Singular, Feminine
______________________________________
In one embodiment for checking agreement, two temporary array areas, temp1 and temp2, are proposed for storing the tokens while agreement is iteratively checked between the identified parts of the noun phrase. The token ›i-2!, identified as the "beginning of the noun phrase" has all of its agreement tags copied to a temporary area, temp1.
______________________________________
Plural, Singular, Singular,
temp1 Masculine Masculine Feminine
temp2
______________________________________
All agreement tags for the next token, token ›i-1!, whose values agree with temp1 area are placed in a second temporary area, temp2.
______________________________________
Plural, Singular, Singular,
temp1 Masculine Masculine Feminine
temp2 Plural, Singular,
Masculine Feminine
______________________________________
As long as there are some identified agreement tags in temp1 and temp2, agreement has passed and the noun phrase can continue to be checked. If there is no match, agreement fails and the noun phrase is broken. When the noun phrase is broken, the last token that agrees with the previous tokens in the noun phrase is re-identified as the "end of the noun phrase". In the current case being examined, there was agreement between temp1 and temp2, so that the contents of temp2 are copies of temp1, and the next token is retrieved.
______________________________________
Plural, Singular,
temp1 Masculine Feminine
temp2
______________________________________
All agreement tags for the next token ›i! whose values agree with temp1 are placed in the second temporary area, temp2. When this is done, the temporary areas contain:
______________________________________
temp1 Plural, Singular,
Masculine Feminine
temp2 Singular, Plural,
Feminine Masculine
______________________________________
Because token ›i-2!, token ›i-1!, and token ›i! all have the above listed agreement tags in common, the contents of the temp2 area are copied to temp1, and the next token is retrieved.
______________________________________
temp1 Singular, Plural,
Feminine Masculine
temp2
______________________________________
All agreement tags for the next token ›i+1! whose values agree with temp1 are placed in a second temporary area, temp2. When this is done, the second temporary areas contain:
______________________________________
temp1 Singular, Plural,
Feminine Masculine
temp2 Singular,
Feminine
______________________________________
Because the token ›i-2!, token ›i-1!, token ›i!, and token ›i+1! all have these agreement tags in common, the contents of the temp2 area are copied to temp1, and the next token is retrieved.
______________________________________
temp1 Singular,
Feminine
temp2
______________________________________
At this point, noun phrase processing ends in our example. All the tokens from token ›i-2! to token ›i+1! had at least one agreement tag in common, and thus passed the agreement test. In a further embodiment, the agreement checker 57 of the processor 47 creates a "supertag" when checking agreement in accordance with action box 255 of FIG. 8. The supertags allow the agreement module 57 to quickly identify whether the extracted tokens fail to agree, or whether they may agree. In particular, a supertag is created for each extracted word contained within the identified noun phrase by logically OR'ing together all the agreement tags associated with each identified token in the noun phrase. A supertag associated with one token in the noun phrase is then compared against the supertag associated with the following token in the noun phrase to see if any form of agreement is possible. A form of agreement is possible if the required number, gender, and case parameters agree or contain potential agreements between each of the supertags. If the required number, gender, and case parameters contained in the supertags do not agree, then agreement is not possible. By making this comparison, it can be quickly determined whether or not agreement may exist between the tokens or whether agreement is impossible. After action box 255, logical flow proceeds to decision box 256. At decision box 256 the processor 47 identifies whether the user requested application of the truncation rules to the noun phrase identified in action box 253. If the user did not request application of the truncation rules, control branches to action box 258. If the user did request application of the truncation rules, logical control proceeds to action box 257 wherein the truncation rules are applied. At action box 257, the truncator module 61 of the processor 47 truncates the identified noun phrases. In one aspect of the invention, as illustrated by the pseudocode listing of FIG. 11, truncator 61 truncates noun phrases exceeding two words in length which satisfy a specific set of rules. In accordance with another aspect of the invention, the truncator 61 removes tokens within the noun phrase that fail to agree with the other tokens within the noun phrase. Preferably, this operation is achieved by the truncator module 61 operating in conjunction with the agreement checking module 57. For example, agreement module 57 identifies those tokens within the noun phrase that are in agreement and those tokens that are not in agreement, and truncator module 61 re-examines which tokens belong in the noun phrase based upon the agreement analysis of agreement checking module 57. Thus truncator module 61 truncates from the noun phrase the set of tokens following, and including, a token that does not agree with the preceding members of the identified noun phrase. At action box 258, processor 47 outputs the tokens extracted from the input stream of natural language text into the output buffer 19 of the application program interface 11. Processor 47 also generates the token list 17 that correlates the input buffer of text 15 with the output buffer 19, and places the token list 17 into the application program interface. The generated token list 17 comprises an array of tokens that describe parameters of the input and output data. The parameters associated with each token include the part-of-speech tags, the grammatical features, and the noun-phrase member tags. With this data, processor 30 in digital computer 12 is able to output to display 20 the identified noun phrases contained within the input stream of natural language text. FIG. 12 illustrates an example of the operation of the noun-phrase analyzer 13 having an input buffer 400, a token list 402, an output buffer 404, and identified noun phrases 406. In particular, input buffer 400 contains a natural language text stream reading The cash flow is strong, the dividend yield is high, and. Token list 402 contains a list of tokens, wherein the tokens cash and dividend are identified as the "beginning of a noun phrase", and wherein the tokens flow and yield are identified as the "end of a noun phrase". Output buffer 404 contains a list of the lexical expressions found in the input buffer 400, and box 406 contains the identified noun phrases cash flow and dividend yield. FIG. 12 demonstrates the ability of the noun-phrase analyzer 10 to identify groups of words having a specific meaning when combined. Simply tokenizing the word in the stream of text and placing them in an index could result in many irrelevant retrievals. FIG. 13 illustrates a pseudocode listing for implementing a morphological analyzer/generator 2. In particular, the morphological analyzer can contain a processor 30 implementing the pseudocode listing of FIG. 13 as stored in memory 12. Additional tables, as illustrated in FIG. 4A-4C, necessary for the implementation of morphological analyzer/generator 2 can also be stored in memory element 12. Lines 1 and 54 of the pseudocode listing in FIG. 13 form a first FOR-LOOP that is operational until the noun form, the verb form, and the adverb/adjective form of the candidate word are each processed. In operation, processor 30 implements the conditions within the first FOR-LOOP of lines 1 and 54 by accessing the FIG. 3 representative entry 33 associated with the candidate word. The representative entry 33 includes a noun pattern field 46, a verb pattern field 48, and an adjective/adverb pattern field 50. Each of the fields (e.g., 46, 48, and 50) identifies a particular morphological transform in FIG. 4C. Lines 2-4 of the pseudocode listing contain steps for checking whether morphological paradigms associated with each particular grammatical field being processed (i.e. noun, verb, adjective/adverb) exist. The steps can be implemented by processor 30 accessing the FIG. 3 representative entry of the candidate word and identifying whether the fields 46, 48, 50 identify a valid morphological paradigm. Lines 5-9 of the pseudocode of FIG. 13 include a logical IF-THEN-ELSE construct for determining the morphological paradigms associated with the candidate word. In particular, these steps form a variable called "LIST" that identifies the locations of paradigms. "LIST" can include one location in column 73 of FIG. 4C, or "LIST" can include a portmanteau rule identifying a plurality of locations in column 73. Lines 10 and 53 of the pseudocode listing form a second FOR-LOOP nested within the first FOR-LOOP of lines 1 and 54. The second FOR-LOOP of lines 10 and 53 provide a logical construct for processing each of the paradigms contained in "LIST". Lines 11 and 52 form a third nested FOR-LOOP that processes each candidate word once for each part-of-speech tag of the candidate word (identified as "POS tag" in FIG. 13). The part-of-speech tags of the candidate word (i.e. "POS tag") are identified by the POS Combination Index Field 44 of FIG. 3 that is associated with the candidate word. In one aspect of the invention, lines 12-18 include steps for identifying morphological transforms of the candidate word given a part-of-speech tag for the candidate word and given a morphological paradigm for the candidate word. For example, the pseudocode instructions determine whether the baseform part-of-speech tag of the morphological transform (identified as "BASE POS" in FIG. 13) matches the part-of-speech tag of the candidate word. If a match is found, then the morphological transform is marked as a possible morphological transform for the candidate word, and the candidate word can be identified as a baseform. Lines 27 and 51 of FIG. 13, in accordance with another aspect of the invention, contain a further nested FOR-LOOP. This FOR-LOOP operates upon each of the morphological transforms listed in the particular paradigm from `LIST` that is currently being processed. Further in accordance with the invention, each morphological transform within the current paradigm being processed is inspected to determine whether the morphological transform is an appropriate morphological transform for the candidate word. In particular, as illustrated by pseudocode lines 28-31, processor 30 identifies an appropriate morphological transform based upon whether a parameter of the candidate word matches a morphological pattern contained within a selected morphological transform For instance, line 28 of the pseudocode determines whether the part-of-speech tag of the candidate word matches the part-of-speech tag of the morphological transform. If a match exists, the morphological transform is identified as an applicable transform for the candidate word. In accordance with another embodiment of the invention, as shown in pseudocode lines 28-29 of FIG. 13, the processor 30 can identify an appropriate morphological transform based upon various parameter of the candidate word matching various morphological patterns contained within a selected morphological transform. The parameters of the candidate word can include: information contained within the representative entry 33, of FIG. 3; the length of the candidate word; and the identity of the character strings forming the candidate word, i.e. the suffixes, prefixes, and infixes in the candidate word. While the morphological patterns of a selected morphological transform are generally selected from the functional elements contained in the morphological transform. Thus, the morphological patterns can be selected from: a functional element defining the part-of-speech tag of the baseform; a functional element defining the character string to strip from a candidate word; a functional element defining the character string to add to a candidate word; and a functional element defining the part-of-speech tag of the morphologically transformed candidate word. For example, the processor 30 can compare the suffix of a candidate word with the second functional element of the selected morphological transform, wherein the second functional element generally denotes the suffix to strip from the candidate word to form an intermediate baseform. In an alternative embodiment, the processor 30 can compare the prefix of the candidate word with the second functional element of the selected morphological transform. While in another embodiment the processor 30 compares the infix of the candidate word with the second functional element of the selected morphological transform. Following the comparison step, processor 30 then identifies those morphological transforms having morphological patterns matching the selected parameter of the candidate word as an appropriate transform for the candidate word. Preferably, as illustrated in lines 28-31 of the FIG. 13 pseudocode listing the processor 30 only applies those transforms that both: (1) have a part-of-speech tag matching the part-of-speech tag of the candidate word; and (2) have a first character string to be removed from the candidate word that matches either a suffix, prefix, or infix in the candidate word. According to a further embodiment of the invention, prefixation and infixation can be handled by separate structural elements in the system, as illustrated by pseudocode lines 32-35 of FIG. 13. Lines 32-35 illustrate a separate modular element for determining an applicable transform based on prefixation. Lines 32-35 first identifies whether the current morphological transform has the prefix flag set, as described in the discussion of FIB. 4C. If the prefix flag is set, a separate morphological prefix table containing morphological changes applicable to prefixes is referenced. The prefix table can be identified through the representative word entry 33 for the candidate word. The prefix table will provide a list of baseform and inflection prefix pairs. To handle prefixation, the processor 30 will locate the longest matching prefix from one column in the prefix table, remove it, and replace it with the prefix from the other column. Preferably, these modifications will only be done when a morphological transform is tagged as requiring a prefix change. An analogous system can be created to address infixation. Prefixation and infixation morphology are particularly applicable in Germanic languages, such as German and Dutch. In these languages the morphology of the word can change based upon the alteration of a character string in the beginning, middle, or end of the word. For example, German verbs display significant alternations in the middle and end of words: the verb einbringen (ein+bringen) forms its past participle as ein+ge+bracht, with the infixation (insertion) of the string ge between the verbal prefix and stem; and the transformation of the stem bringen into bracht. The morphological analyzer/generator 2 illustrated in FIG. 13 provides a system capable of morphologically transforming words found within natural language text. For example, the multilingual text processor 10 of FIG. 1 can extract the candidate word drinks from a stream of text and forward the candidate word to analyzer/generator 2 through interface 11. The text processor 10 can further identify a representative entry 33 for the candidate word. Once a representative entry is located, the text processor 10 can provide information concerning the word drinks, such as the parts-of-speech and inflectional paradigms. In particular, the text processor 10 determines the parts-of-speech of drinks to be noun plural and verb 3rd singular present; and the text processor determines the locations of a noun inflectional paradigm, a verb inflectional paradigm, an adjective/adverb paradigm, and a derivational paradigm. After the text processor 10 obtains the data related to the candidate word drinks, the text processor can generate the appropriate morphological transforms in accordance with the pseudocode listing of FIG. 13. The morphological analyzer/generator 2 first addresses the noun inflectional paradigm, and determines that the noun paradigm has only one paradigm. Analyzer/generator 2 then processes the candidate word by applying the inflectional transforms contained within the identified noun paradigm to each part-of-speech of the candidate word drinks. The inflectional transforms within the noun paradigm are applied by first determining which inflectional transforms should be applied, and by then applying those inflectional transforms to generate inflectional baseforms. For instance, the candidate word contains a part-of-speech of noun plural which must first be matched with particular inflectional transforms contained within the noun paradigm. The matching can be accomplished, in one embodiment, by comparing the parts-of-speech associated with a particular transform to the part-of-speech of the candidate words. Thus, analyzer/generator 2 compares the current part-of-speech of the candidate word, i.e., noun plural, to the part-of-speech tags associated with the inflectional transforms stored in the noun inflectional paradigm. The analyzer determines: (1) the baseform part-of-speech of the noun paradigm is noun singular, that does not match the part-of-speech tag of the candidate word; (2) the first inflectional transform has as associated part-of-speech tag of noun singular possessive, that does not match the part-of-speech tag of the candidate word; and (3) the second inflectional transform has an associated part-of-speech tag of noun plural, that does match the associated part-of-speech tag of the candidate word. These comparison steps indicate that only the second inflectional transform matched the noun plural part-of-speech of the candidate word, and that therefore only the second inflectional transform contained within the noun paradigm is applied. Analyzer/generator 2 then continues to process the candidate word by applying the inflectional transforms contained within the identified verb paradigm and the identified adjective/adverb paradigm. The verb paradigm contains one paradigm having a baseform and two inflectional transforms, while the candidate word is associated with a potentially matching part-of-speech tag of verb 3rd singular present. The baseform part-of-speech tag of the verb inflectional paradigm is "verb infinitive", that does not match the part-of-speech tag of the candidate word. The part-of-speech tag of the first inflectional transform is verb present participle, that does not match the part-of-speech tag of the candidate word. But, the part-of-speech tag of the second inflectional transform is verb 3rd singular present, that does match the part-of-speech tag of the candidate word. Thus, the inflectional transform contained within the second rule of the verb inflectional paradigm is applied to the candidate word. After the application of the noun paradigm and the verb paradigm, the analyzer 2 processes the transforms contained within the adjective/adverb paradigm. In this particular case, the adjective/adverb paradigm is blank, thereby completing the inflectional transformation of the candidate word drinks. FIG. 14 depicts a processing sequence for the uninflection module 5 for generating inflectional baseforms that begins at step 300. At step 302 the candidate word for the inflectional analysis is obtained. Preferably, the candidate word is obtained from a stream of natural language text by tokenizer 43 as described in connection with FIG. 6. After step 302, logical flow proceeds to step 304. At step 304 the processor 30 obtains data relevant to the candidate word. This data is obtained by first finding a substantially equivalent expression to the candidate word in the word data table 31. The substantially equivalent expression in the word data table 31 is then accessed to obtain an associated representative entry 33. A representative entry 33 contains data such as the part-of-speech combination index, the noun inflection paradigms, the verb inflection paradigms, and the adjective/adverb inflection paradigms. The data obtained from representative entry 33 can also identify portmanteau paradigms that act as branching points to multiple numbers of other paradigms. At action box 310, the flow chart indicates the beginning of the analysis of each paradigm. At steps 312 and 314 the system determines whether the part-of-speech of the candidate word is in the same class as the current paradigm. For example, the processor determines whether the part-of-speech of the candidate word is the same as the part-of-speech of the paradigm identified by either the noun field 46, the verb field 48, or the adjective/adverb field 50 in the representative entry 33. If the part-of-speech of the candidate word is not in the same class as the current paradigm, logical flow branches back to action block 312. If the part-of-speech tag of the candidate word agrees with the current paradigm, then logical flow proceeds to decision box 316. Decision box 316 illustrates one preferred embodiment of the invention, wherein the candidate word is compared to the paradigm's baseform. If the candidate word matches the paradigm baseform, logical flow proceeds to decision box 328. That is, if the candidate word matches the subparadigm's baseform no uninflection is necessary. In many situations, however, the candidate word will not match the paradigm baseform. When the candidate word differs from the paradigm baseform, logical flow proceeds to action box 318. Action box 318 begins another logical FOR-LOOP wherein each inflectional transform is processed. In accordance with FIG. 14, logical flow proceeds from box 318 to decision box 320. At decision box 320 two aspects of the invention and a preferred embodiment are illustrated. In particular, action box 320 indicates that the part-of-speech tag of the candidate word can be compared with the fourth functional element of the inflectional transform (i.e. the functional element specifying the part-of-speech of the transform). If the part-of-speech tags matches, then logical flow proceeds to action box 322. However, if the part-of-speech tags differ, logical flow branches back to box 18. According to a further aspect of the invention, as illustrated in action box 320, the ending character strings of the candidate word and the second functional element of the inflectional transform (i.e. the functional element specifying the suffix to strip from the candidate word) are compared. If the character strings do not match, logical flow proceeds back to action box 318 while if the character strings match, logical flow proceeds to action box 322. Preferably, as illustrated in FIG. 14, the uninflectional module 5 compares the part-of-speech tags associated with the inflectional transform and the candidate word, and the uninflectional module 5 compares the character strings associated with the inflectional transform and the candidate word. According to this preferred embodiment, only if the part-of-speech tags match and the character strings match does logical flow proceed to action box 322. At step 322, uninflection module 5 implements a strip and add algorithm to form the inflectional baseform of the candidate word. The strip and add algorithm is obtained from the inflectional transform currently being processed. The transform currently being processed indicates a particular character string to be removed from the candidate word and a subsequent character string to be added to the character word to form the inflectional baseform. After step 322, logical flow proceeds to decision box 324. Decision box 324 is an optional step involving prefixation. If prefixation operations are requested by the user, boxes 324 and 326 will be activated. At decision box 324 the processor 30 identifies whether the inflectional transform currently being considered has a prefixation rule associated with it. If the transform does contain the prefixation rule logical flow proceeds to action box 326, otherwise logical flow proceeds to action box 328. At action box 326 the prefix is removed from the baseform in accordance with the inflectional transform. Logical flow then proceeds to box 328. Steps 328, 330, 332, and 334 are optional steps demonstrating one implementation of the coupling between the inflection module 4, the uninflectional module 5, the derivation expansion module 6, and underivation (derivation reduction) module 7. In particular, action box 328 identifies whether the user has requested underivation (derivation reduction). If underivation (derivation reduction) has been requested, logical flow proceeds to action box 330, otherwise flow proceeds to decision box 332. At action box 330 the candidate word undergoes underivation (derivation reduction) in accordance with the flowchar | ||||||
