|
|
|
Input of abbreviated word form |
Text searching system5369577
Abstract
An apparatus for searching a collection of words based upon an input word, the apparatus including means for generating a first set of words containing members that are lexically related to the input word, the first set of words including words that are other than regular inflectional nouns; and also including a search engine for searching the collection of words to detect the occurrence of any of the words from a group of search words, the group of search words including the input word and the first set of words.
Claims
What is claimed is:
1. A computer apparatus for searching a collection of words based upon an input word, said computer apparatus comprising:
a generating means for generating a first set of words containing members that are lexically related to the input word, said first set of words including words that are both regular inflectional nouns and words that are other than regular inflectional nouns, the generating means including
a recognition engine for finding at least one linguistic stem within said input word and identifying at least one suffix attached to each said linguistic stem,
generating members of said first set of words by stripping from said input word suffixes selected from a first group of suffixes which do not substantially change the meaning of words when stripped therefrom, and
performing inflectional and derivational analysis upon said input word by generating all inflections and derivations that can be generated from said linguistic stems by the addition of one suffix wherein said derivational analysis employs at least three derivational suffixes; and
a search engine for searching said collection of words to detect the occurrence of words from a group of search words, said group of search words comprising the input word and said first set of words.
2. The computer apparatus of claim 1 wherein said first set of words further includes words that are both regular inflectional verbs and words that are other than and regular inflectional verbs.
3. The computer apparatus of claim 1 wherein said generating means is a morphological analyzer.
4. The computer apparatus of claim 3 wherein said recognition engine generates a derivational history including information about any parses that were successfully performed by said recognition engine on said input word.
5. The computer apparatus of claim 4 wherein said derivational history contains, for each successful parse of the input word, the part of speech of the input word for that particular parse.
6. The computer apparatus of claim 3 wherein said morphological analyzer comprises a recognition engine for finding the linguistic stem within said input word by proceeding through the input word from the beginning of the input word and selecting successively longer sequences of characters from the input word, starting with at least the first character of the input word, and comparing the successive sequences of characters to stems stored in a stem lexicon and then identifying suffixes attached to each said linguistic stem by selecting sequences of characters of the input word following the linguistic stem and comparing each selected sequence of characters following the linguistic stem to suffixes stored in a suffix lexicon.
7. The computer apparatus of claim 3 wherein said morphological analyzer further comprises a means for generating a second set of words that are lexically related to selected members of said group of synonyms, said selected members of said group of synonyms being members of said group of synonyms which have linguistic stems and suffixes which allow the affixing of further suffixes.
8. The computer apparatus of claim 7 wherein said means for generating a second set of words generates said second set of words based upon information supplied by said recognition engine for the input word, the information supplied by said recognition engine include, for each parsing of said input word, a linguistic stem, an identification of each suffix attached to said linguistic stem, and a part of speech of the input word for the parse of the input word and the recognition engine using the linguistic stem, suffixes and part of speech for each parse to select said selected members of said group of synonyms.
9. The computer apparatus of claim 7 wherein said group of search words further comprises said second set of words.
10. The computer apparatus of claim 1 wherein said collection of words is an index wherein the index is a group of words not comprising a text.
11. The computer apparatus of claim 1 wherein said collection of words is a text.
12. The computer apparatus of claim 1 wherein said first group of suffixes includes members selected from a first subset containing all inflectional suffixes, #ful, #ish, +ious, +ic, +al, #ar, #er, #or, +ive, +ory, #able, +able, +ible, #ment, #ness, +ity, +ety, +ty, #ly, #ize, +ify, and #y.
13. The computer apparatus of claim 1 further comprising a thesaurus database and wherein said generating means uses said thesaurus database to generate a group of synonyms, the members of said group of synonyms being synonyms of members of said first set of lexically related words.
14. The computer apparatus of claim 13 wherein said group of search words further comprises the group of synonyms.
15. The computer apparatus of claim 13, wherein said generating means generates a second group of words from members of said group of synonyms which have linguistic stems and suffixes which allow the affixing of further suffixes by adding to said members of said group said further suffixes which are selected from a second group of suffixes wherein the members of said second group of suffixes include only those suffixes which do not substantially change the meaning of words when added thereto.
16. The computer apparatus of claim 15 wherein said group of search words further comprises the second group of words.
17. The computer apparatus of claim 15 wherein said second group of suffixes includes members selected from a subset containing all inflectional suffixes, #ful, #ish, +ous, +ic, +al, #ar, #er, #or, +ive, +ory, #able, +able, +ible, #ment, #ness, +ity, +ety, +ty, #ly, and #y.
18. A computer apparatus comprising a subject expansion system for expanding an input word into a plurality of related words, the computer apparatus comprising:
a means for receiving the input word; and
a generating means for generating a first group of words from the input word,
wherein said generating means generates at least some of said first group of words by stripping from said input word only suffixes selected from a first group of suffixes, the members of said first group of suffixes including only those suffixes which do not substantially change the meaning of words when stripped therefrom, and includes
a recognition engine for finding all linguistic stems within said input word and performing inflectional and derivational analysis upon said input word by generating all inflections and derivations that can be generated from said linguistic stems by the addition of one suffix wherein said derivational analysis employs at least three derivational suffixes.
19. The computer apparatus of claim 18 wherein said first group of suffixes includes inflectional and derivational suffixes.
20. The computer apparatus of claim 19 wherein said first group of suffixes includes members selected from a first subset containing all inflectional suffixes, #ful, #ish, +ous, +ic, +al, #ar, #er, #or, +ive, #able, +able, +ible, #ment, #ness, +ity, +ety, +ty, #ly, #ize, +ify, +fy, and #y.
21. The computer apparatus of claim 18 wherein said generating means generates at least one of said first group of words by adding to words derived from said input word by stripping suffixes from said input word only those suffixes which are selected from a second group of suffices, the members of said second group of suffixes including only those suffixes which do not substantially change the meaning of words when added thereto.
22. The computer apparatus of claim 21 wherein said second group of suffixes includes inflectional and derivational suffixes.
23. The computer apparatus of claim 22 wherein said first group of suffixes includes members selected from a first subset containing all inflectional suffixes, #ful, #ish, +ous, +ic, +al, #ar, #er, #or, +ive, #able, +able, +ible, #ment, #ness, +ity, +ety, +ty, #ly, #ize, +ify, +fy, and #y.
24. The computer apparatus of claim 18 further comprising a thesaurus database and wherein said generating means indexes said thesaurus database with members of said first group of words to read from said thesaurus database synonyms corresponding to members of said first group of words having synonyms in said thesaurus database to generate a group of synonyms comprised of said synonyms read from said thesaurus database, the members of said group of synonyms being synonyms of certain members of said first group of words which have said synonyms in said thesaurus database.
25. The computer apparatus of claim 24 wherein said generating means expands said group of words by adding to selected members of said group of synonyms only those suffixes which are selected from a second group of suffixes, wherein said selected members of said group of synonyms are those members of said group of synonyms which have linguistic stems and suffixes allowing the affixing of further suffixes and said second group of suffixes including only those suffixes which do not substantially change the meaning of word when added thereto.
26. The computer apparatus of claim 25 wherein said second group of suffixes includes inflectional and derivational suffixes.
27. The computer apparatus of claim 26 wherein said first group of suffixes includes members selected from a first subset containing all inflectional suffixes, #ful, #ish, +ous, +ic, +al, #ar, #er, #or, +ive, #able, +able, +ible, #ment, #ness, +ity, +ety, +ty, #ly, #ize, +ify, +fy, and #y.
28. The computer apparatus of claim 18 wherein said generating means comprises a morphological analyzer.
29. The computer apparatus of claim 28 wherein said morphological analyzer comprises said recognition engine for generating a base history for said input word, said base history identifying one or more base words for said input word, said one or more base words representing the input word with one or more suffixes removed.
30. The computer apparatus of claim 29 wherein said recognition engine comprises:
a means for finding a linguistic stem within said input word; and
a means for identifying suffixes attached to said linguistic stem, wherein said stem finding means and said suffix identifying means cooperate to conduct morphological analysis of the input word from the root to the affix.
31. A computer apparatus for generating a group of words from an input word, said computer comprising:
means for removing suffixes from said input word to generate at least one base word;
a thesaurus means adapted to receive said one or more base words and generate a group of synonyms for certain of said at least one base word; and
a generating means comprising a recognition engine for generating a base history for said input word, said base history identifying said one or more base words, said generating means including
a means for finding a linguistic stem within said input word,
a means for identifying suffixes attached to said linguistic stem, wherein said linguistic stem finding means and suffix identifying means cooperate to conduct morphological analysis of the input word from a root to an affix, and
a means for performing inflectional and derivational analysis by generating all inflections and derivations than can be generated from said linguistic stems by the addition of one suffix wherein said derivational analysis employs at least three derivational suffixes.
32. The computer apparatus of claim 31 wherein said generating means further comprises a generation engine for generating a second set of words containing members that are lexically related to selected members of said group of synonyms, said selected members of said group of synonyms comprising the synonyms of said group of synonyms which have lingusitic stems and suffixes which allow the affixing of further suffixes.
33. The computer apparatus of claim 32, further comprising a selection means for selecting members of said group of synonyms to identify said selected members of said second group of words by identifying said synonyms of said group of synonyms which have linguistic stems and suffixes which allow the affixing of further suffixes.
34. A computer-implemented method for generating a collection of words from an input word, said method comprising:
generating, by means of a generating engine, a first set of words containing members that are lexically related to the input word, said first set of words including words that are both regular inflectional nouns and words that are other than regular inflectional nouns, by operation of a recognition engine
stripping from said input word only suffixes selected from a first group of suffixes, the members of said first group of suffixes including only those suffixes which do not substantially change the meaning of words when stripped therefrom; and
generating, by operation of the generating engine, a second set of words that are lexically related to the input word, said second set of words including words that are both regular inflectional nouns and words that are other than regular inflectional nouns by
finding at least one linguistic stem within said input word,
identifying suffixes attached to each said linguistic stem, and
by operation of the recognition engine, performing inflectional and derivational analysis by generating all inflections and derivations than can be generated from said linguistic stems by the addition of one suffix wherein said derivational analysis employs at least three derivational suffixes.
35. The method of claim 34 for generating a collection of words from an input word further comprising the steps of:
stripping said suffixes from said input word to generate one or more base words; and
generating a group of synonyms for certain of said one or more base words by indexing a thesaurus database with said certain of said one or more base words and reading corresponding synonyms from said thesaurus database to comprise said group of synonyms.
Description
BACKGROUND OF THE INVENTION
The invention relates generally to morphological analyzers and to text management systems.
Each year organizations spend countless hours searching through documents and images, organizing filing systems and databases. Even with large information retrieval systems, considerable resources are needed to index documents, guess which key words will locate needed information, search through pages one query at a time, and sort through all the irrelevant data that the search actually yields.
A number of studies evaluating large information retrieval systems show that these systems are retrieving less than 20 percent of the documents relevant to a particular search, and at that the same time only 30 percent of the retrieved information is actually relevant to the intended meaning of the search request. One of the key reasons for poor retrieval results is that the people who perform retrieval only know the general topics of their interest and do not know the exact words used in the texts or in the keyword descriptors used to index the documents.
Another study analyzed how long it would take to index 5000 reports. It was assumed that each user was allowed 10 minutes to review each report, make indexing decisions by selecting the keywords, and record the information. At this rate, it would take 833 hours or 21 weeks for one full-time person (at 40 hours per week) to process the documents. The users would also need extra time to verify and correct the data. Under such an approach, the user must index incoming documents on a daily basis to keep the system from falling hopelessly behind. In addition, since the user chooses the relevant search terms, all unspecified terms are eliminated for search purposes. This creates a significant risk that documents containing pertinent information may not show up during a search because of the user's subjective judgments in selecting keywords.
Many text retrieval systems utilize index files which contain words in the documents with the location within the documents for each word. The indexes provide significant advantages in the speed of retrieval. One major disadvantage of this approach is that for most of the systems the overhead of the index is 50 to 100 percent of the document database. This means that a 100 Mbyte document database will require an index ranging from 50 to 100 Mbytes. This adds mass storage costs and overhead to the system.
SUMMARY OF THE INVENTION
In general, in one aspect, the invention is an apparatus for searching a collection of words based upon an input word. The apparatus includes means for generating a first set of words containing members that are lexically related to the input word, the first set of words including words that are other than regular inflectional nouns; and it includes a search engine for searching the collection of words to detect the occurrence of any of the words from a group of search words, the group of search words including the input word and the first set of words.
Preferred embodiments include the following features. The set of words includes words that are other than regular inflectional nouns and regular inflectional verbs. The generation means is a morphological analyzer. The collection of words is an index or text. The generating means generates certain members of the first set of words by stripping from the input word only suffixes selected from a first group of suffixes. The members of the first group of suffixes include only those suffixes which do not substantially change the meaning of words when stripped therefrom. The first group of suffixes includes members selected from a first subset containing all inflectional suffixes, #ful, #ish, +ous, +ic, +al, #ar, #er, #or, +ive, +ory, #able, +able, +ible, #ment, #ness, +ity, +ety, +ty, #ly, #ize, +ify, +fy, and #y. The morphological analyzer also includes a recognition engine for recognizing the input word and returning derivational information about the recognized input word. The recognition engine operates by parsing the input word to identify any stems within the input word and, for each identified stem, to identify suffixes which are added to produce the input word. The recognition engine returns a derivational history including information about any parses that were successfully performed by the recognition engine on the input word. The derivational history contains, for each successful parse of the input word, the part of speech of the input word for that particular parse.
Also in preferred embodiments, the recognition engine operates by finding a stem within the input word and then identifying suffixes attached to that stem. The recognition engine proceeds through the input word from the root to the affix and performs inflectional and derivational analysis, wherein the derivational analysis uses more than two derivational suffixes. The apparatus also includes a thesaurus database which the generating means uses to generate a group of synonyms, the members of the group of synonyms being synonyms of certain members of the first set of lexically related words. The group of search words includes the group of synonyms. The generating means generates a second group of words from selected members of the group of synonyms by adding only those suffixes to selected members of said group of synonyms which are selected from a second group of suffixes. The group of search words also includes the second group of words. The second group of suffixes includes only those suffixes which do not substantially change the meaning of words when added thereto. More specifically, the second group of suffixes includes members selected from a second subset containing all inflectional suffixes, #ful, #ish, +ous, + ic, +al, #ar, #er, #or, +ive, +ory, #able, +able, +ible, #ment, #ness, +ity, +ety, +ty, #ly, and #y.
In addition, in preferred embodiments the morphological analyzer includes a generation engine for generating a second set of words that are lexically related to selected members of the group of synonyms. The generation engine generates the second set of words based upon information supplied by the recognition engine for the input word. The group of search words further includes the second set of words thus generated. There is also a selection means for selecting members of the group of synonyms to identify the selected members.
In general, in another aspect, the invention is a subject expansion system for expanding an input word into a plurality of related words. The system includes means for receiving the input word; and means for generating a first group of words from the input word, wherein the generating means generates at least some of the first group of words by subtracting only suffixes selected from a first group of suffixes, the members of the first group of suffixes including only those suffixes which do not substantially change the meaning of words when stripped therefrom.
Preferred embodiments include the following features. The first group of suffixes includes inflectional and derivational suffixes. More specifically, the first group of suffixes includes members selected from a first subset containing all inflectional suffixes, #ful, #ish, +ous, +ic, +al, #ar, #er, #or, +ive, +ory, #able, +able, +ible, #ment, #ness, +ity, +ety, +ty, #ly, #ize, +ify, +fy, and #y. The generating means generates at least some of the first group of words by adding to words derived from the input word only those suffixes which are selected from a second group of suffixes. The second group of suffixes including only those suffixes which do not substantially change the meaning of words when added thereto. The second group of suffixes includes inflectional and derivational suffixes. More specifically, the second group of suffixes includes members selected from a second subset containing all inflectional suffixes, #ful, #ish, +ous, +ic, +al, #ar, #er, #or, +ive, +ory, #able, +able, +ible, #ment, #ness, +ity, +ety, +ty, #ly, and #y.
Also in preferred embodiments, the subject expansion system includes a thesaurus database which the generating means uses to generate a group of synonyms which includes among its members synonyms of certain members of the first group of words. The generating means expands the group of synonyms into a larger group of words by adding to selected members of the group of synonyms only those suffixes which are selected from a second group of suffixes. The second group of suffixes includes only those suffixes which do not substantially change the meaning of words when added thereto. The second group of suffixes includes inflectional and derivational suffixes. More specifically, it includes members selected from a second subset containing all inflectional suffixes, #ful, #ish, +ous, +ic, +al, #ar, #er, #or, +ive, +ory, #able, +able, +ible, #ment, #ness, +ity, +ety, +ty, #ly, and #y.
In general, in yet another aspect, the invention is a subject expansion system for generating a group of words from an input word. The system includes means for removing suffixes from the input word to generate one or more base words; and a thesaurus means adapted to receive the one or more base words and generate a group of synonyms for certain of them.
Preferred embodiments include the following features. The generating means includes a recognition engine for generating a base history for the input word, wherein the base history identifies those one or more base words. The recognition engine includes means for finding a stem within the input word; and means for identifying suffixes attached to that stem. The stem finding means and suffix identifying means cooperate to conduct morphological analysis of the input word from the root to the affix. The recognition engine performs inflectional and derivational analysis, wherein said derivational analysis employs more than two derivational suffixes. The generating means also includes a generation engine for generating a second set of words containing members that are lexically related to selected members of the group of synonyms. The apparatus also includes a selection means for selecting members of the group of synonyms to identify the selected members.
One advantage of the invention is that it is capable of performing full inflectional and derivational analysis of an input word. Also, the invention is capable of recognizing multiple suffixes that are attached to a stem word to produce the input word.
Another advantage of the invention is that it substantially improves the completeness of results in document and/or text searches. Furthermore, the invention automatically expands keywords in a search to include lexically related words, synonyms of certain of said lexically related words, and words that are lexically related to the synonyms. In addition, the invention gives a user the ability to search text and other documents for concepts without knowing the exact form in which the concepts are discussed. Furthermore, the invention is capable of analyzing and expanding a keyword to include synonyms even when the keyword is not in a thesaurus by generating lexically related words by stripping suffixes from a keyboard to create new base words and by adding suffixes to the new base words.
Another advantage of the invention is that it can be used to substantially reduce this overhead in document indexing by identifying the words that are important to the content of the document and generating an index containing this subset of words. This not only saves disk space but also processing time, since the search operations can be performed on a much smaller index. The invention generates a list of topic words from text, each of which is informative about the content of the text, and it excludes words from text that do not convey much information. Using linguistic information to perform this task provides several principled ways of distinguishing words with high information content from words with low information content.
Another advantage is that the invention helps eliminate additional redundancy in an index of concepts found within text by identifying words that are related by inflection or derivation (e.g., "category" and "categories," "subject," "subjects," and "subjectness"). Instead of representing each instance of a word that differs slightly from previous instances, a base of the word is stored and used to represent all forms that can be derived from it. Thus, for example, "category" may be stored once in the list and it can represent "category," "categories," and even "categorical." This greatly reduces the storage required for indexes to text.
The filtering procedures described in this application were applied to a number of text samples for evaluation, including a 153-word paragraph from a linguistics article, a 447-word financial memo, and a page from a functional specification containing 550 words. The resulting lists for these three contained 23%, 25%, and 17% of the words in the original text, respectively. The results obtained by testing other text samples were consistent with these numbers. The filtering mechanisms of the invention can identify 75% to 80% of the text as irrelevant to the topic of the text.
Other advantages and features will become apparent from the following description of the preferred embodiment and from the claims.
DESCRIPTION OF THE PREFERRED EMBODIMENT
FIG. 1 is a block diagram showing a morphological analyzer and a lexicon.
FIG. 2 is a base history for `fish`;
FIG. 3 is a block diagram of an indexing system which includes an intelligent filter;
FIGS. 4a-e present a pseudo-code description of the operation of the intelligent filter;
FIG. 5 is a topic data structure used by the intelligent filter to pass topic information back to the application;
FIG. 6 illustrates the use of the intelligent filter to analyze text, one block at a time;
FIG. 7 is a block diagram of a text searching system including a subject expansion module;
FIG. 8 is a block diagram of the subject expansion module shown in FIG. 7;
FIGS. 9a-b present a pseudo-code description of the operation of SESrecword;
FIG. 10 is an information object which is a member of the information list data structure created by SESrecword;
FIG. 11 is a list of suffixes which can be added to a word and a list of suffixes which can be stripped from a word without significantly changing the meaning of the word;
FIG. 12 is a block diagram of another embodiment of the subject expansion module shown in FIG. 7;
FIG. 13 is a pseudo-code description of the operation of SESexplist;
FIG. 14 is a pseudo-code description of the operation of SESgen;
FIG. 15 is a block diagram showing the morphological analyzer and the lexicon;
FIG. 16 illustrates the structure of the stem lexicon;
FIG. 17 illustrates the form of the data structure associated with each node in the stem lexicon;
FIG. 18 is a continuation class tree for `fish`;
FIG. 19a is the structure of the property byte portion of a continuation class lexicon entry;
FIG. 19b is the structure of the continuation class node portion of a continuation class lexicon entry;
FIG. 19c is the structure of the <type/suff-flag-maps> portion of a continuation class lexicon entry;
FIG. 19d is the structure of the <alt/offset> portion of a continuation class lexicon entry;
FIG. 20a illustrates a portion of a suffix lexicon;
FIG. 20b relates the continuation class numbers used in FIG. 20a to corresponding continuation classes and the associated suffixes;
FIGS. 21a-c show parse trees for `satisfy`, `satisfaction` and `revolution` respectively;
FIGS. 22a-h present a pseudo-code description of the operation of WFSrecognize;
FIG. 23 is a stem.sub.-- record data structure;
FIG. 24 is a CC.sub.-- node data structure;
FIG. 25 is a stem.sub.-- node data structure;
FIGS. 26a-d present a pseudo-code description of the operation of Lookup.sub.-- stems;
FIGS. 27a-e present a pseudo-code description of the operation of Lookup.sub.-- suffixes;
FIG. 28 is a base history record data structure;
FIG. 29 is a base history for `fishiness`;
FIGS. 30a-c present a pseudo-code description of the operation of Make.sub.-- base.sub.-- history;
FIGS. 31a-d present a pseudo-code description of the operation of Base.sub.-- hist.sub.-- loop;
FIGS. 32a-o present a pseudo-code description of the operation of WFSgenerate;
FIGS. 33a-c present a pseudo-code description of the operation of Get.sub.-- records;
FIG. 34 presents a pseudo-code description of the operation of Quickgen; and
FIGS. 35a-b present a pseudo-code description of the operation of Genloop;
Structure and Operation
Referring to FIG. 1, two principal components of the embodiments described herein are a morphological analyzer 10 and a lexicon 12. To provide a basis for describing the embodiments which employ these components, only a brief description of these two components will be presented at this time. For the operation of the embodiments which employ these components as described herein, the particular details of construction and operation of morphological analyzer 10 and lexicon 12 are not important. Nevertheless, after describing the embodiments which employ these components, a more detailed description of the design and operation of the particular morphological analyzer and lexicon used in the embodiments is presented later.
The Lexicon
Lexicon 12, contains information about the underlying (lexical) forms of all the words that can be generated or recognized (in their surface form) by morphological analyzer 10. Lexicon 12 is not simply a listing of all these words in their underlying form, but instead contains morphological items, referred to as morphemes, that are combined to build the lexical forms of words. For example, the morphemes `success`+`ful`+`ly` form the word `successfully`. Associated with each morpheme is information that can include the part of speech the morpheme forms, related forms, and a list of what kinds of morphemes can follow the current one.
The Morphological Analyzer
The main functions of morphological analyzer 10 are recognition and generation. An internal function, referred to as WFSrecognize, implements the recognition capabilities of morphological analyzer 10. Another internal function, referred to as WFSgenerate, implements the generation capabilities of morphological analyzer 10. WFSrecognize analyzes a word to determine its morphological structure, and WFSgenerate generates the correct spelling of a word given its underlying morphological structure. Each of these internal functions will be described in greater detail later.
When performing recognition, morphological analyzer 10 analyzes an input string, identifying its root (or roots), any intervening words in its derivational history, any suffixes it may contain, and the lexical categories of all words in the derivational history. If the input string is successfully parsed (i.e., if a stem and the associated suffixes required to form the input word are found), morphological analyzer 10 returns a base history. If the input word has more than one parse, a base history for each is returned, each history (or path) presenting a listing of each form of the input word as each successive suffix is stripped. Thus, for example, in the case of `fishes`, two histories, shown in FIG. 2, are returned.
Each base history includes one or more records showing the bases that result from successively stripping off suffixes. Note that the word left after a suffix has been removed is called the base of the word with the suffix added. If no more suffixes can be stripped off a word, the word is the root of the other words. Each record contains the word from which a suffix is stripped, the part of speech of that word, an index referring to the record in the history that contains the base that is formed after the suffix is stripped, the suffix which is added to the base to make the word and the rating of the suffix which was added. The suffix rating is a number that is used to classify a suffix and the word it forms. Suffixes are grouped according to the closeness in meaning before and after the suffix is added. For example, inflectional endings, which simply change grammatical features of the original word, have a rating of 1. Derivational endings, which usually change the part of speech of the original word and generate a word more distant in meaning, have a rating of 2.
When performing generation, morphological analyzer 10 synthesizes words that are lexically related to a given word, adding suffixes, if appropriate, using any restrictions specified with the input regarding suffixes or lexical category to control the size and content of the set of words returned. Generation involves the following sub-tasks. Morphological analyzer 10 first does recognition on the input string to find out: how many parses the input string has; its stem or stems; and if it already contains suffixes. Morphological analyzer 10 then identifies the stem which is to be used for synthesis. Next, morphological analyzer 10 determines what suffixes can be added to it and returns all surface strings that can be synthesized from the lexical input word.
Text Management Functions
In accordance with one aspect of the invention, the morphological analyzer is used to aid text management functions, such as indexing and searching. Embodiments for carrying out this text management function will now be described.
Referring to FIG. 3, in an indexing system 100, input words 102 from a block of text are passed to an intelligent filter 104 that automatically identifies which of the input words are topic or concept words, i.e., words which describe the meaning of the text in which the words appear. Intelligent filter 104 uses several mechanisms to identify the topic words, including a stop list 106, which lists words that are automatically rejected as topic words, a keep list 108, which lists words that are automatically accepted as topic words, and a morphological analyzer 110, which analyzes the input word to identify its morphological structure. Morphological analyzer 110 makes use of a lexicon 116 to perform the morphological analysis of the input words that are passed to it. Intelligent filter 104 uses the morphological information derived for the input words that are sent to morphological analyzer 110 to determine which of the input words are most likely to be topic words. The list of topic words generated by intelligent filter 104 then passes to a conventional indexing engine 114, which uses the topic words to generate an index for the text from which the input words were extracted. Indexing engine 114 may use any one of a number of known indexing techniques to either produce index cards for the text being indexed or to generate an index identifying the location of the topic words within the body of the text.
Stop list 106 contains a list of words that typically do not convey topic information, such as articles, prepositions, conjunctions, and other function words. A substantial number of the words in this list can be identified from a dictionary on the basis of their part of speech. The value of adding all of such words to a stop list is that intelligent filter 104 need not waste time and computational resources of morphological analyzer 110 to analyze those words. Stop list 106 also includes words that were identified by running samples of text through intelligent filter 104 and then analyzing the results to identify words that appear at the output but clearly do not convey topic information. Thus, another value of stop list 106 is that it serves to catch those few words that are not caught by the analytic component of the intelligent filter 104.
Keep list 108 serves the opposite function to that of stop list 106. It specifies words that the analytic component of intelligent filter 104 might tend to reject even though they convey topic information. Keep list 108 may be empirically generated by running several sample documents through intelligent filter 104 and then identifying those words which were rejected but which the user feels should be included among a list of topic words. In this way, keep list 108 provides a mechanism for fine tuning the system and for customizing the system to the particular needs and the unique vocabulary of particular users.
The operation of intelligent filter 104 will now be described with the aid of the flow diagram presented in FIGS. 4a-d. When a text indexing application program needs to index a block of text, it first calls an initialization function to open up the functions and lexicons (i.e., the reference works) that will be used by intelligent filter 104 to analyze the words passed to it. The application which calls intelligent filter 104 also allocates space for a topic structure 124 (see FIG. 5), which is used by intelligent filter 104 to pass topic information back to the application. Topic structure 124 includes a field 115 for the input word which was analyzed, a field 119 for an array of pointers to the bases that were found in the input word, and a field 117 for the number of bases within the array.
In its call to intelligent filter 104, the application passes an input word 102, i.e., the current word. Referring to FIGS. 4a-d, intelligent filter 104 first determines whether the current word qualifies for morphological analysis by checking whether it contains alphabetic characters (step 130). If it does not have alphabetic characters, filter 104 rejects the current word as a topic word and requests the next word from the application program (step 132). If the current word does have alphabetic characters, filter 104 checks whether there are quotation marks around it (step 134). If the current word is a quoted string, filter 104 identifies it as a topic word and passes this information to indexing engine 114 using topic data structure 124 so that indexing engine 114 can index the current word (step 136). Then filter 104 requests another word. On the other hand, if the current word is not a quoted string, filter 104 looks up the word in keep list 108 (step 138).
If the current word appears in keep list 108, filter 104 indexes the word and moves on to the next word (step 140). If the current word does not appear in keep list 108, filter 104 compares it against stop list 106 (step 142). If the current word appears in stop list 106, filter 104 rejects the word and requests the next word (step 144). If the current word does not appear in stop list 106, filter 104 calls upon the recognition capability of morphological analyzer 110, namely, WFSrecognize, to obtain a morphological analysis of it (step 146).
WFSrecognize identifies the stems within the current word and determines which suffixes have been added to those stems to create the current word. The analysis begins with the first character in the current word and proceeds one character at a time until each stem and any suffixes which have been attached to that stem are found. That is, WFSrecognize parses the input word. When the analysis is complete, WFSrecognize returns a base history for the current word. The base history consists of a list of history arrays, each of which contains the records of a corresponding one of the successful parses which were found.
It is possible that WFSrecognize will not find the current word or any base words for the current word within lexicon 116. In that case, WFSrecognize indicates that the word was not recognized. This occurs for words such as some proper names (e.g., Nixon). When WFSrecognize reports that the current word was not recognized, filter 104 treats the current word as a topic word, indexes it and then requests the next word (step 148).
It is also possible that the current word contains a character that cannot be processed by WFSrecognize. In such cases, filter 104 indexes the word and then moves onto the next word (step 150).
If WFSrecognize succeeds in parsing the current word, intelligent filter 104 uses the returned base history information to evaluate whether the current word is a topic word. First, filter 104 checks whether the current word has only one part of speech associated with it (i.e., was there only one successful parse of the word?) (step 152). If only one part of speech was returned, filter 104 checks what the part of speech is. If the part of speech is a noun, filter 104 indexes the current word and moves on to the next word (step 154). For example, the current word might be `history` or `science` both of which are nouns, in which case, filter 104 indexes the word.
On the other hand, if the part of speech is an adjective, such as `historic` or `scientific` filter 104 obtains the base of the current word and checks its part of speech (step 156). (Recall that the base is identified in the base field of the history record for the current word.) If the part of speech of the base of the current word is a noun, filter 104 indexes the base and then moves onto the next word (steps 158-160). If the part of speech of the base is an adjective, filter 104 obtains the base of the base and checks its part of speech (steps 162-164). For adjective bases, filter 104 indexes the base only if the base of the base is a noun (step 168). Otherwise, it rejects the word and moves on to the next word (step 170). For example, if the current word is `historical` i.e., an adjective having a noun base of `history` filter 104 indexes `history`. If the base is neither a noun nor an adjective, filter 104 also rejects the current word and moves on to the next word.
If the current word is an adverb, filter 104 performs a similar analysis to that which it performed for an adjective (steps 174-202). That is, filter 104 gets the base of the current word and checks its part of speech. If the base is a noun, filter indexes the base and moves on to the next word. However, if the base is an adjective, filter 114 looks at the next level, i.e., the base of the base. If the base of the base is a noun, as in the case of `scientifically` filter 104 indexes the base of the base (i e., `science`) and moves on. If the base of the base is an adjective, however, filter moves to the next level of the history for the current word and looks at the base of the base of the base, if one exists. If the base of the base of the base is a noun, filter 104 indexes that base, otherwise, it rejects the word and moves on.
If the current word has only one part of speech and if it is not a noun, an adjective or an adverb, filter 104 rejects it. Similarly, if the current word is an adverb but its base is neither a noun nor an adjective, filter 104 rejects it.
For the cases in which the current word has two (and only two) parts of speech (i.e., it is ambiguous), filter 104 indexes the word only if one part of speech is a noun and the other part of speech is not an adverb (steps 206-212).
For the cases in which the current word has three parts of speech (and only three), filter 104 indexes the word only if one of its parts of speech is a noun (steps 214-220).
Finally, for those cases in which the current word has more than three parts of speech, filter does not index the word (steps 222-224).
In the above description, it should be noted that if the current word is ambiguous (e.g., `leaves` may be the plural of `leaf` or the third person singular of the verb `leave`), filter 104 will output more than one base for the word.
After filter 104 has processed all of the words in the text, the application frees up any allocated tables that were used and it closes the reference work.
After the application has used the information in the topic structure that was passed to the application, the application also frees the memory allocated for it.
In the above-described embodiment, text is processed one word at a time. That is, the application program passes each word to intelligent filter 104, one word at a time, and generates an index using whatever is returned. It is also possible for the application to pass a block of text to intelligent filter 104, in which case filter 104 generates an array of topic structures, one structure for each topic word identified in the block of text. In that case, the application program calls a text analysis program which operates as shown in the flow diagram of FIG. 6.
The application program first initializes the text analysis program, and the appropriate data structures and it opens the reference work containing the text that is to be indexed (step 225). Then, the application program fills a buffer with a first block of text to be analyzed and passes this to the text analysis program which, in turn, tokenizes the text to identify the individual words within it (steps 229 and 231). Known techniques may be used to tokenize the text, using, for example, spaces and/or quotation marks as indicators of word boundaries. The text analysis program calls intelligent filter 104 for each word found within the block of text (step 233). Intelligent filter 104 processes the words passed to it and builds a topic structure array including the individual topic structures produced for each of the topic words found among the words from the block of text. After the topic structure array for the entire block of text is complete, the application program retrieves the individual topic structures within the array and checks each topic word to see if it has already been identified for the text. If it has not, the topic word is added to a list of concepts associated with the text (steps 235 237).
The application program continues in this fashion through the text, one block at a time until the entire text has been indexed (step 239). After processing all of the text, the application program frees up the memory which was used during the text analysis operation and closes all relevant files.
Note that intelligent filter 104 tends to compress the size of the index required for any given text for at least two reasons. First, it identifies words that qualify as topic words and rejetcs other words which do not relate to content of the text. In addition, for many text words it supplies a word that is lexically related to the text word and the indexing engine indexes the lexically related word. As a consequence, a group of lexically related words within the text tends to be represented by a single word selected from the group of lexically related words.
The approach of using a single word to represent a group of lexically related words in a text index may be employed by itself. In that case, the user might manually select a word from text which is to be represented in the index and then use the recognition capabilities of the morphological analyzer to produce a group of words that are lexically related to the selected word. As the indexed word, the indexing engine would then use a word selected from the set of words formed by the original word and its lexically related words. In other words, each word in the resulting index represents a set of lexically related words within the text. In that case, each of the lexically related words in the text does not appear as a separate word in the index.
FIG. 7 illustrates a system 10 in which morphological analyzer 110 assists in searching text. In text search system 201, an input word 203, which is part of a search request generated by a user, is passed to a Subject Expansion Module (SES module) 205 where it is expanded into a list of related subjects. A search engine 213 uses the expanded list generated from the input word to search text stored in a memory 207 and to return as output the identity of those documents which refer to any of the items appearing on the expanded list. To perform its subject expansion functions, SES module utilizes morphological analyzer 110 to generate a list of words lexically related to the input word. Morphological analyzer 110 relies upon the information stored in a lexicon 110 to perform a morphological analysis of the input word and generate a corresponding base history. The history contains the different bases found within the input word and the suffixes that are attached to those bases to generate the input word. Using the information developed by morphological analyzer 110, SES module 205 uses a thesaurus lexicon 211 to expand certain of the lexically related words into a list of synonyms for those words. In general, the expansion is done for each of the definitions that are found for the input word, i.e., for each part of speech. However, since synonyms are not available for all words, in the cases when synonyms are not found for the input word, SES module 205 uses lexically related words that are found within the base history generated for the input word.
The operation of SES module 205 is generally illustrated by the block diagram shown in FIG. 8. When an input word is passed to SES module 205, it calls an SESinit function 215 to initialize the data structures, functions and lexicons which will be used during its operation. After initialization, a SESrecword function 217 uses morphological analyzer 110 to construct a base history for the input word and, using the information available in the base history, generates a list of synonyms from thesaurus lexicon 211. For each of the words and the synonyms generated by SESrecword, a second function SESexplist 219, takes the words and synonyms generated by SESrecword 217 and generates a set of lexically related words and synonyms and lexically related synonyms, which it passes along with the input word and its lexically related words to search engine 213 as an expanded list to be used for searching. After the expansion task is completed, a SESfini function 221 frees up the memory that was initialized for this operation and closes the lexicons. SESrecword 217 and SESexplist 219 will now be described in greater detail.
SESrecword 217 operates as shown in FIGS. 9a-b. When the function is first called, it uses WFSrecognize to analyze the input word that is passed as part of the function call (step 230). For each successful parse of the input word WFSrecognize returns a history. Each history contains, among other things, the stem, the suffixes which are attached to the stem to form the input word and the part of speech of the input word. Since it is possible that more than one definition of the word exists, WFSrecognize may find more than one successful parse.
After WFSrecognize completes its recognition function and returns the information, SESrecword 217 loops through each definition (i.e., path in the history) that is returned (step 232) and for each path, it performs the following operations. First, it checks whether the current path has the same part of speech as any other path which it has processed thus far (step 254). If the current path has the same part of speech as a previously processed path, SESrecword 217 skips the current path and moves on to the next one (step 236). Note that the expansion of a path that yields the same part of speech as a previously expanded path will also yield the same expansion.
If the part of speech for the current path is new, SESrecword 217 checks whether the current part of speech is inflected (step 236). If it is inflected, there is the possibility that the base for the current path has already been found for a previous path. For example, the past tense and the past participle of a verb have the same base and, therefore, yield the same expansion. To avoid duplicating previous work and in the case of inflected words, SESrecword 217 checks the base of the current path against other bases that have been found for other paths (step 240). Information about the other paths is stored in an information list that SESrecword 217 is building during this process to return its results to the application program. The information list, which is a collection of different forms that are recognized for the input word, is a linked list of information objects 223, as illustrated in FIG. 10. If the base is found within the information list, SESrecword 217 skips the current path. Otherwise, it checks whether the base form is in thesaurus lexicon 211. If the base form is in the thesaurus lexicon, SESrecword 217 builds a new information object 223 for the base and adds the new object to the information list. The new object contains all of the synonyms that were found in the thesaurus lexicon associated with that base.
Referring back to FIG. 10, each object 223 includes a field 223(a) containing the word for which synonyms were found, a pos.sub.-- list 223(b) identifying the parts of speech associated with the word, a thesaurus buffer (TH buffer) 223(c) which contains all of the synonyms for the word, and a pointer 223(i) to the next object in the list. Each information object also includes a num.sub.-- words field 223(d) for specifying the number of synonyms in TH buffer 223(c), a num.sub.-- meanings field 223(c) for the number of different meanings or connotations that were listed for the word, and a pos.sub.-- sum field 223(f) for predetermined statistics about the average number of generated forms that will typically result from expanding the word as a function of its part of speech. In other words, pos.sub.-- sum provides a basis for estimating how many words an expansion of the list of words in TH buffer 223(c) will produce. Experience indicates that the number of words that an expansion will produce depends on the part of speech of the word. For example, a noun typically results in three words when expanded, a verb results in seven words and an adjective results in four words. This information is used later by the expansion function to determine how many words of the TH buffer to expand.
Each information object 223 also includes a sample information (sample.sub.-- info) field 223(q) containing an array of pointers to each of the possible samples within TH buffer 223(c) and a selection info field 223(b). A sample is defined as a possible meaning or connotation of the word. In the case of the College Thesaurus, synonyms are organized into different meaning categories. The pointers in sample.sub.-- info field 223(q) identify the beginning of each of the categories within TH buffer 223(c). As will be explained in greater detail shortly, in some embodiments, the user is offered the option of selecting which samples to include within the expansion performed by SESexplist 219. The user's indications are recorded in selection.sub.-- info field 223(k) for later use.
Referring again to steps 238-248 in FIG. 9a, after constructing the information object for the base or after determining that the base form is not found in the thesaurus lexicon, SESrecword 217 also checks whether the input word is in the thesaurus as an inflection (step 246). If it is, SESrecword builds another information object for the input word (step 248).
For the cases in which the part of speech for the current path is not inflected, SESrecword 217 checks if the word is in the thesaurus lexicon (step 252). If the word is in the thesaurus lexicon, an information object is built for the word and added to the information list (step 256). On the other hand, if the word is not in the thesaurus, SESrecword 217 examines its history to determine whether a suffix can be stripped from the word to form a base having substantially the same meaning (steps 258-260). The words have substantially the same meaning means, for example, if the function of the words created by stripping the suffix is semantically the same as the function of the word from which the suffix is stripped. Only certain suffixes can be stripped without significantly changing the meaning of the word. A list of those suffixes appears in FIG. 11. (Note that `+` and `#` are boundary characters or markers following Aronoff, Mark (1976) Word Formation in Generative Grammar, Linguistic Inquiry Monograph 1, MIT Press, Cambridge, Mass.) Such a list is stored in a table which SESrecword accesses to determine whether it is permissible to strip the current suffix so as to create a word from which synonyms may be generated. If the suffix can be stripped, SESrecword 217 tries to find the resulting base in the thesaurus lexicon. If the base is found, the function builds a word information object for the base. If it cannot be found, however, the function builds a word information object for the input word without any thesaurus data.
After SESrecword has looped through each path and performed the operations described above, it returns the completed list of information objects to the application program.
As shown in FIG. 12, in another embodiment, SES module 205 is modified to permit the user to skip over information objects for expansion or to select the synonyms within an information object for which expansion will be performed (an operation referred to generally as sampling). In the modified version of SES module 205, the application passes the output of SESrecword 217 to a SESsample function 223 which selects a subset of words from each category (i.e., from each meaning) and the application displays these subsets of words to the user. In the present embodiment (i.e., with the College Thesaurus), SESsample 223 selects the first two words in each category for display. The user then selects which information objects are to be skipped and the categories within information objects for which expansion is desired. Another function, namely, SESadd.sub.-- sel 227, stores the user's selections in selection.sub.-- info field 223(b) of the appropriate information object 223 (see FIG. 10).
In both embodiments described above, the actual expansion of the words returned by SESrecword 217 is performed by two other functions, namely, SESexplist 219 and SESgen 221 and illustrated in the flow diagrams of FIGS. 13 and 14, respectively. Each of those functions will now be described.
SESexplist 219 takes the information list returned by SESrecword 217 and expands it to include lexically related words generated from the recognized forms of the input word and the synonyms found by SESrecword 217. Referring to FIG. 13, when SESexplist 219 is first called, it loops through the list of information objects (step 270) and computes the total number of words in the TH buffers from the numbers in num.sub.-- words fields 223(d) (see FIG. 10) (step 272). Then, using the information stored in pos.sub.-- sum field 223(f), SESexplist 219 estimates the total number of words likely to result from expanding all of the sampled categories obtained through the thesaurus (step 274). (If the sampling feature is not present, it is assumed that words from all categories will be expanded.) The total number is then scaled to reflect any limitations that might exist on available memory. In a DOS environment, for example, the memory space limitations can be quite severe; whereas, in other environments, such as virtual memory systems, it may not be necessary to limit the number of words that are generated. Using the scaled number, SESexplist 219 computes the number of words that can be safely chosen from each sampled category for expansion without exceeding the available memory. The distribution of words chosen from the categories is performed so as to fairly represent all of the sample categories in the TH buffers 223(c). That is, some number is chosen from all selected categories with the larger categories receiving larger representation than the smaller categories.
After SESexplist 219 determines the number of words that can be expanded from each category, it loops through the list of information objects again to perform the actual expansion operations (step 276). For the current information object, SESexplist 219 first checks whether it was selected to be skipped (step 278). If the current object is marked to be skipped, SESexplist 219 moves on to the next information object. If the current object is not selected to be skipped, SESexplist 219 checks whether it has been sampled by examining the contents of its selection.sub.-- info field 223(h). If the object has been sampled and selections were made, SESexplist calls SESgen 221 on those selections. In this phase of operation, SESgen 221 expands the permitted number of words within the sampled category based upon the computations performed in steps 272 and 274, above.
If the object has neither been skipped nor sampled, SESexplist 219 calls SESgen 221 on all of the data present in the object (step 282). During this step, SESgen only expands the number of words in each category that are permitted by the limitations computed in steps 272 and 274, above.
When SESexplist 219 is done expanding the word in a given information object, it frees up any space associated with that object (step 284). After SESexplist has looped through all information objects, it returns the results of the expansion to search engine 213, which constructs a search based upon the information in the expanded list. In other words, search engine 213 conducts its search using the input word, words that are lexically related to the input word, synonyms of the input word and of the words that are lexically related to the input word (if any such synonyms exist), and words that are lexically related to the synonyms. Search engine 213 uses any one of a number of known techniques to use the expanded list to conduct the search for all documents or locations within documents that refer to any words on the expanded list.
The repeated calls to SESgen 221 generate the result list that is returned. As shown in FIG. 14, when SESgen is called, it first identifies the input word and its part of speech from the information found within the appropriate fields of the information object (step 290). For the input word, SESgen calls a generation function, WFSgenerate, which causes morphological analyzer 110 to produce all inflections and derivations that can be generated from that word by the addition of one suffix. It also produces any inflections of the derivational forms. WFSgenerate returns an output.sub.-- history data structure., which is an array of histories for all of the expansions that were found. The first record in each history contains the input word, and the last record contains an inflected form, a derivational form, or an inflection of a derivational form. In this last case, a middle record will contain the derivational form, i.e., the input word with one derivational suffix attached.
From among the histories that were produced, SESgen 221 selects all inflectional forms and adds those to a result list (step 294). If derivational forms are also included within the output.sub.-- history data structure, SESgen 221 selects only those derivations that have a suffix which can be attached to the current part of speech of the input word without significantly changing the meaning of the word. Only certain derivational suffixes may be attached to a base without significantly changing its meaning. As in the case of stripping suffixes, words formed by adding suffixes will have substantially the same meaning as the base word, for example, if the function of the words created by adding the suffix is semantically the same as the function of the word to which the suffix is added. A list of those suffixes appears under the appropriately labelled column in FIG. 11. Such a list of suffixes is stored in a table which SESgen accesses to determine whether any derivations have a suffix that can be attached to the current part of speech. It adds only those entries to the result list.
After SESgen has processed the input word of the information object, it loops through the synonyms found within the TH buffer in the information object (step 296). For the current synonym, SESgen 221 compares its part of speech with that of the input word (step 298). If the parts of speech are the same, SESgen 221 performs the same expansion as was described above. That is, it generates all inflections and derivations of the synonym which have only one suffix added to the synonym and it adds the inflections and only those derivations that have a suffix which can be attached to the current part of speech of the synonym to the result list for the information object (steps 300 and 302).
After SESgen 221 has looped through all of the synonyms for which expansions are to be generated, it returns the result list to SESexplist 219 (step 300).
Note that a limited form of subject expansion can be employed which does not use a thesaurus. In that case, the search is conducted using the input word plus the group of lexically related words that are generated by the morphological analyzer.
Having described the text management systems which employ the morphological analyzer, further details regarding the described embodiments of the morphological analyzer and the lexicon will now be provided.
The Lexicons
In the described embodiment, the morphemes have been divided into two main categories which are handled differently: the root morphemes (stems) and the ending morphemes (suffixes). The ending morphemes are attached to the root morphemes and to each other. Referring to FIG. 15, lexicon 12 is divided into two sub-lexicons based on these two categories of morphemes: a stem lexicon 16 containing the stem forms, and a suffix lexicon 20 containing the suffixes from all the different continuation classes. A third lexicon, referred to as continuation class lexicon 18, contains a continuation class tree for each stem. Each continuation class tree identifies all of the suffixes which may be added to the associated stem and, thus, it represents a set of links into suffix lexicon 20 for that stem.
Stem and suffix lexicons 16 and 20 are stored in compacted letter trie form with any information associated with the stem or ending accessed from the last character of that stem or ending. As will become apparent in the following discussion, however, the information associated with a stem differs from that associated with an ending.
The Stem Lexicon
Stem lexicon 16 is made up of two types of entries: (1) valid English words (e.g., "permit"), and (2) abstract stems, which are not valid words but to which suffixes can be added to form complete words (e.g., "permiss," which serves as a stem for "permissive" and "permission"). Referring to FIG. 16, stem lexicon 16 includes compacted letter tries including nodes 30, each one representing a different character of a stem found within the trie. Associated with each node 30 is a data structure 32 (shown in greater detail in FIG. 17) which contains information about that node and the nodes which are connected to it on the next lower level of the tree. Each data structure 32 has a character and at least three flags, namely, an end-of-word (EOW) flag 34, an end-of-stem (EOS) flag 36 and a sisters flag 38, and it may include a pointer array 40. EOW flag 34 indicates whether that node represents the end of that branch. EOS flag 36 indicates whether the node represents an end of a valid stem. And, sisters flag 38 indicates whether another branch extends off of the parent node. If a stem ends at a node, its pointer array 40 contains an index which identifies that stem's entry into continuation class lexicon 18 where all of the continuation class information for the stem is stored.
The Continuation Class Lexicon
A continuation class is associated with a single suffix and/or a single lexical category, which is the category of the word derived by adding that suffix. Stems do not themselves have lexical category information associated with them. Stems that are legitimate words (i.e., not abstract) have a continuation class that represents a null suffix but assigns a part of speech to the stem. A stem that is lexically ambiguous has such a continuation class for each lexical category to which it belongs.
When words of more than one lexical category can be derived by adding a single suffix, a separate continuation class is assigned for each lexical category. For example, `#s` is an inflectional suffix that forms plural nouns and also third person singular verbs; `#er` is an inflectional suffix that forms comparative adjectives and a derivational suffix that forms nouns from verbs. (Note that inflectional endings simply add inflection to the part of speech of the original word; whereas, derivational endings change the part of speech of the original word and form words that are more distant in meaning.) For each of these suffixes, there are two distinct continuation classes. By requiring each continuation class to be associated with no more than one lexical category, the system can take advantage of constraints on the possible sequence of suffixes, since most suffixes attach to words of a particular lexical category.
The system makes use of over 140 continuation classes, two thirds of which are associated with a suffix and a lexical category. There are 13 continuation classes associated with other affixes such as `+fac` in `satisfaction`. Since the addition of these formatives does not produce a complete word, they have no lexical category associated with them. Another 36 continuation class are associated with a lexical category but no affix and are used to assign a part of speech and grammatical features to roots (e.g., NOUN assigns noun singular; VPAST assigns verb past tense) and to irregular inflectional forms that do not contain a suffix.
Appendix A1 identifies all of the top level classes (level 1 classes) represented in the continuation class lexicon. Appendix A1 identifies the class (e.g. VERB), the numerical code used to represent that class (in the case of VERB, it is 3c); and the part of speech (POS) for that class, which in the case of VERB is `verb tenseless`.
Every stem has its own unique tree of continuation class symbols that represent the valid combinations of suffixes that can be attached to the stem. Besides this tree, there are fields for the exceptional cases, i.e., ones that cannot be included in the continuation class tree. These exceptional case fields contain a cross-reference to another stem in stem lexicon 16 which represents an irregular form.
Here `irregular forms` means words which cannot be related to their base through the continuation classes because of spelling changes too significant to be handled by the spelling change rules of the morphological analyzer. Such irregular forms include: irregular verbs (forgive-forgave-forgiven); nouns with irregular plurals (index-indices); and irregular comparative and superlative forms of adjectives (good-better-best). The irregular forms are given separate entries in the lexicon, where they refer back to their bases. The base entry is linked to the irregular form entries by the <IRR> label and the irregular form entries are linked to the base entry by the <BAS> label.
Associated with the continuation class for the irregular stem form is a corresponding field cross-referencing back to the stem the irregular form is based on. Besides the continuation class tree and the exception fields, there is a flag that appears when the last syllable of a stem, even if it is just one syllable, is stressed or when the last character of a stem should be doubled. This is called a gemination mark which will cause the automaton to double the last character if other conditions allow it.
Stress is represented in the continuation class lexicon for all words that exhibit primary stress on the last syllable including monosyllabic words. Because stress information is used solely by the gemination rule, secondary stress is not represented in the lexicon at all, and no stress information is represented in abstract stems or in continuation class lexicon 18. Consequently, stress is not represented for derived or inflected forms. Words that do not bear final stress but undergo gemination are also marked in the lexicon.
Table I following gives examples of the entries in continuation class trees associated with the stems `fish` and `shoot`.
______________________________________
Stem: fish
Gemination Mark: true
Continuation Classes:
Class - Corresponding.sub.-- Word(s)
<1> NOUN - fish (n.)
<2> LIKE - fishlike
<2> LESS - fishless
<2> Y#ADJ - fishy
<3> ADJREG - fishier, fishiest
<3> LY#ADV - fishily
<3> NESS - fishness
<2> NREG - fish's, fishes, fishes'
<2> ERY - fishery
<3> NREG - fishery's, fisheries, fisheries'
<1> IN1 - fish (n. pl.)
<1> VERB - fish (v.)
<2> ABLE#ADJ - fishable
<2> ER - fisher
<3> NREG - fisher's, fishers, fishers'
<2> VREG - fishes, fished, (have) fished, fishing
Stem: shoot
Gemination Mark: true
Continuation Classes:
Class Corresponding.sub.-- Word(s)
<1> VERB - shoot (v.)
<2> VPR3 - shoots
<2> VING - shooting
<2> ER shooter
<3> NREG - shooter's, shooters, shooters'
Irregular form: shot
Part of speech of irregular form: (v.p.t.)
Irregular form: shot
Part of speech of irregular form: (v.p.p.)
______________________________________
FIG. 18 displays the same continuation class information for `fish` but in tree form. Note that there is no part of speech information associated with the stem itself. Part of speech is associated with suffixes only. Because of this, all symbols on the first level of a continuation class tree (identified as LEVEL 1 in FIG. 18) are special. They represent null suffixes which contain no characters, but they assign the different possible parts of speech to the stem form to which they attach.
An inflection or derived word that cannot be generated by merely concatenating a root and affixes and applying spelling-change rules is linked to its real root by a pair of cross-references, one leading from the root to the derived form, and one leading from the derived form to the root. This mechanism is used to link words represented with abstract stems to roots (e.g., the `clar` entry with `clarify,` `clarification,` etc. to `clear`), irregular inflections to their root (e.g., `made` to `make`), and to link roots with words whose derivation violates spelling-change rules.
The layout of a continuation class lexicon entry is as follows: entry => <property><cc node><BAS><IRR><REL>. As shown in FIG. 19a, the property byte 40 includes a five bits which identify the properties of the entry. There is a STR bit 42, a BAS bit 44, an IRR bit 46, a REL bit 48 and an ABBR bit 50. STR bit 42 indicates whether the stem for this entry geminates. BAS, IRR and REL bits 44, 46, and 48 indicate if any of each type of field appears after the listing of the continuation classes. ABBR bit 50 indicates if the current entry belongs to an abbreviation (1 if it does and 0 it does not). If ABBR bit 50 is set to 1, there are no continuation classes for the entry.
Referring the FIG. 19b, a continuation class node 52 (i.e., <cc node>) is a one or two byte entry. The first byte includes an EOP bit 54, an ALT bit 56, and 6 bits 58 which encode the class for that node. EOP and ALT bits 54 and 56 are used for navigation through the continuation class tree. EOP bit 54 indicates whether the current node is the end of the path. ALT bit 56 indicates whether the current node has a sibling. If the encoded continuation class in the first byte is all zeros, the next byte contains the actual class.
The irregular field information is stored in the following form: {BAS, IRR, REL}=> <alt/offset> <type/suff-flag/maps> [<suff>] <map/class> <pos> <entry>. As shown in FIG. 19d, the <alt/offset> portion is a 1 byte entry which gives the length of the irregular string (i.e., <entry>) found at the end of the irregular field information entry. It also includes an ALT bit 66 which indicates whether another irregular field follows the current entry. The <type/suff-flag-maps> portion is a 1 byte entry having the structure shown in FIG. 19c. It includes two type bits 60, a SUFF bit 62 and a five-bit portion 64 which specifies the number of map fields associated with this particular irregular field. Type bits 60 indicate the type of irregular field; it specifies the bit position of this type as found in the property byte. SUFF bit 62 indicates whether or not there is a suffix byte following this entry. If a suffix byte is present, it is represented by the <suff> portion which is also a 1 byte entry containing a hash number for the suffix associated with the irregular field.
The <map/class> portion is a 1 byte entry for each class associated with the irregular field. Each top level class associated with the irregular form is listed here.
Finally, the <pos> portion is a 1 byte entry which encodes the part of speech associated with this node.
The Suffix Lexicon
Suffix lexicon 18 contains 78 distinct inflectional and derivational suffixes and formatives. Appendix A2 is a complete list of affixes, including both true suffixes and formatives, and it gives examples of words in which the suffixes appear. Note that a formative is a suffix that cannot end a word but must be followed by another. All suffixes are associated with a boundary marker, either `+` or `#` following Aronoff, Mark (1976) Word Formation in Generative Grammar, Linguistic Inquiry Monograph 1, MIT Press, Cambridge, Mass. Assignment of one boundary or the other to a suffix is based on the phonological and orthographic processes (e.g., vowel shift, stress shift, gemination) observed to occur when these suffixes are added to words. For example, all inflectional suffixes are associated with `#`. The boundary distinction is exploited by the spelling-change rules. Information associated with suffixes includes part of speech and basic grammatical features (e.g., number for nouns and tense for verbs), a suffix rating which indicates whether the suffix is inflectional, derivational, or formative in the sense explained earlier.
The information associated with a suffix morpheme includes a part of speech and a suffix rating. The suffix rating is a number that is used to classify a suffix and the word it forms. Suffixes are grouped according to the closeness in meaning before and after the suffix is added. For example, inflectional endings, which simply change grammatical features of the original word, have a rating of 1. Derivational endings, which usually change the part of speech of the original word and generate a word more distant in meaning, have a rating of 2. A derivational ending is one that derives a new word from the original, usually by changing the part of speech. There are two other kinds of endings that are discussed in other sections. These are formative endings, which are given a rating of 16, and REL endings, which are given a rating of 4. Suffix lexicon 18 further differs from stem lexicon 16 in that it contains continuation class vectors.
As mentioned above, all the endings that can follow a stem form are stored together in suffix lexicon 18, and any information about a particular ending is accessed from the last character of that ending. Each node within suffix lexicon 18 specifies: the character represented by the node; a continuation class vector, which specifies the continuation classes found below that node; a vector identifying all continuation classes that terminate at that node; an entry for each continuation class that terminates at that node giving the part of speech and a rating for that suffix and continuation class. The information is stored as a trie of sequential bytes with navigations bits for controlling navigation through the trie.
A hash algorithm determines a unique number for each suffix that can be used as an index into an array containing that suffix. The hash algorithm has two parts. The first part of the algorithm classifies the suffix into a particular category by computing an index for the suffix. In essence, the suffix is classified according to its length, its boundary character, and the first character other than the boundary character. Each category is, in turn, associated with a unique constant which is used to determine an index for the suffixes within the category. The second part of the algorithm distinguishes a suffix from the others in its category. A suffix is distinguished by the summation of its characters.
The algorithm is as follows. First, set the initial value of the index to 0 if the boundary character is `#` or to 1 if the boundary character is `+`. Then, multiply the length of suffix minus 2 by 10 and add the result to the index. For example, if the suffix is `+ety` its length is 4 and the result of the multiplication operation is 20 and the index at this point within the algorithm is 21. Finally, another constant is added to the index based upon the first character of the suffix. The constant which is added is determined by the following chart:
______________________________________
`a` 0
`e` 2
`i` 4
`s` 6
other 8
______________________________________
For `+ety` which has `e` as its first character, the index
becomes 23. The algorithm uses this index to look up a constant in the
constant array shown in Appendix B.
Each entry in Appendix B is headed by two numbers separated by a colon. The number before the colon identifies the category and the number after the colon is the constant that is added to each suffix that falls within that category. Under each entry is a list of suffixes which fall within the category. In other words, the index that is computed for each of those suffixes according to the above-described first part of the algorithm is the same.
In the second part of the algorithm, the sum of the values in ASCII code (mod a) of each character in the suffix (excluding the boundary characters) is computed. That is, with `a`=0, `b`<1, etc., sum the characters of the suffix other than the boundary character. There are two exceptions to this summation: `p` is not added to avoid a collision of `#ship` with `#some` and `z` is given the value `29` instead of `25` to avoid a collision of `+ize` with `+ify`. Finally, add the summation of characters just computed to the constant retrieved from the array. This yields the hash result for the suffix.
Continuing with the `+ety` example, the characters `e`, `t` and `y` have values (mod a) of 4, 19, and 24, respectively. The sum of these values is 47. The constant associated with category 23 is 22. Thus, the resulting has value is 69 (i.e., 49+22).
Each continuation class has a unique number. The continuation class vector contains the numbers of the continuation classes which appear below the current node. That is, it indicates which continuation classes can be found by searching the children of the current node. Thus, when searching for the ending associated with a continuation class, the search algorithm can know which branches in the letter trie to search. Also, when trying to determine what continuation class corresponds to a given suffix, those classes which are not in a vector while traversing the suffix lexicon tree can be eliminated from further consideration. (See Barton Jr., G. Edward, (1985) "The Computational Complexity of Two-Level Morphology," A.I. Memo No. 856, M.I.T. Artificial Intelligence Laboratory, Cambridge, Mass.: 33-35 for more details.)
FIG. 20a illustrates the general structure of the suffix lexicon. The numbers listed in braces {} represent the continuation class vectors which show what classes are found down that branch. The subscripted numbers show what continuation classes end at that point. The continuation classes and their respective numbers for this particular example of a tree are shown in FIG. 20b.
A more complete specification of the suffix lexicon used in the described embodiment is presented in Appendix A3. The information is in the form of a suffix letter trie. The symbols appearing in Appendix A3 have the following interpretations:
Alt=indicates whether there is a sister node in the trie or whether another entry follows the current one at a node;
EOS=stands for end-of-suffix and it indicates whether a suffix ends at that node;
EOB=stands for end-of-branch and it indicates whether the current node has any children;
POS=Part of speech;
DER=Derivational rating;
FORM=Formative rating;
INF=Inflectional rating; and
REL=REL rating.
True and false are represented by 1 and 0, respectively. In addition, numbers, which are shown in hexidecimal notation, are used to identify the continuation classes. A translation of the numbers into corresponding literal symbols identifying the continuation classes is shown in Appendix A4. In the suffix letter trie, when there is an EOS flag, the entire suffix is also printed beside that node. The entire suffix does not actually appear in the suffix lexicon, however.
The interpretation of the information in Appendix A3 will now be described using the first node as an example. The first node represents the `+` boundary character. The state of the flags for this node are: Alt: 1, EOS: 0, EOB: 0. Since the Alt flag is true, the node has a sister node elsewhere in the lexicon. In this case, the sister node represents the other boundary character, namely, `#`. Since both the EOS and EOB flags are false, this node does not represent the end of any suffix or of a branch. The node for `+` also indicates that there are 57 continuation classes which extend from this node and they are identified by the 57 numbers listed for the node. Since no suffixes end at this node, the list of continuation classes which end at this node is empty.
The Morphological Analyzer
When performing generation, morphological analyzer 10 synthesizes words that are lexically related to a given word, adding suffixes, if appropriate, using any restrictions specified with the input regarding suffixes or lexical category to control the size and content of the set of words returned.
The design of these functions is based on a "two-level" model for morphological analysis developed by Kimmo Koskenniemi. (See Kartunnen, L. (1983) "KIMMO: A Two Level Morphological Analyzer") The following is a brief overview of the model.
The two-level model is based on a set of transformational (i.e., spelling) rules. These rules translate the surface string of a word, which is the appearance of the word in text, directly into its lexical string or the underlying morphological structure of the word. This is done character by character based on the context of a character. For example, the following rule, (n . m) <--> P, means that a lexical 'n, becomes a surface `m` before a surface `P`. Each such rule is translated into a finite-state transducer, and the set of these rule-automata is then used by both the recognition and generation algorithms.
The transducers operate on lexical/surface pairs of characters, using one as input and the other as output, depending on whether analyzer 10 is performing recognition or generation. In recognition, a transition means that the transducers read a surface character for input and the corresponding lexical character is output; whereas, in generation, the lexical character becomes the input and the surface character is the output. Because this model is two-level (i.e., it translates directly from the surface to the lexical or vice versa), there are no intermediate stages in the process. This allows a transducer to produce the effects of a rule in just a single pass over the input.
Both generation and recognition involve finding a valid correspondence between the input and a representation that can be constructed using lexicon 12 and the above-mentioned rule set. Thus, morphological analyzer 10 cannot generate or recognize new words, i.e., words that are not already in lexicon 12.
Recognition involves the following sub-tasks. Morphological analyzer 10 looks for all the possible stems in an input string. Each stem identifies all suffixes which may be attached to that stem. For each candidate stem, morphological analyzer 10 looks up the possible suffixes. If the input string is successfully parsed, morphological analyzer 10 returns a base history such as was illustrated in FIG. 2 previously.
Thus, for example, the base history of `successfully` is as follows: ((`successfully`, adverb, `successful`, adjective, `+ly`, 1), (`successful`, adjective, `success`, noun, `+ful`, 2), (`success`, noun, `success`, noun, +0, 0)) where `success` is the root form of the words `successful` and `successfully`.
The root form is generally the same as the stem form found in stem lexicon 16. This is not the case if that stem is not a valid word. For example, in words like `satisfy` and `liquefy` where there appears to be a common ending morpheme `+fy`, the respective stems are `satis` and `lique`. Neither of these stems, however, is a real word. Such a stem is said to be "abstract".
If a stem form is abstract, the root form will consist of either an irregular form that is in a cross-reference field or the abstract stem plus an ending which constitutes the best common morpheme to build from to form the rest of the family of words associated with that stem. `Satisfy` for example, which has the abstract stem `satis` is the root for `satisfied` `satisfactory` and `satisfaction`. The suffixes `+ed` `+ion` and `+ory` attach to verbs, and since satisfy, is a verb that cannot be further reduced, among other reasons, it was chosen as the root form.
As noted previously, each record in the base history also includes a suffix rating. Including the suffix rating in the base history aids in the filtering of words returned from the generation function. By including this in the information passed to the generation function, only those words with the suffixes of that rating will be generated. Or, if all the words related to an input word are generated, a user/application can use the rating to filter those words by itself as needed.
In morphological analyzer 10, an internal function of the suffix rating is to denote endings that form words that are "distantly related" to a root word but can be recognized as root forms in and of themselves. The suffixes that form such words are referred to as REL endings. The relationship between the word with the REL suffix to the smaller root form will only be seen when a user/application causes all related forms of a word to be generated. The "distantly related" words will then be included in that generated list. Another internal function of the rating is to identify a formative, i.e., suffix that cannot end a word but must be followed by another.
Parsing an input word consists of matching the letters in the input word against a sequence of letters in lexicon 12 in such a way that the conditions defined by the spelling rules are met. To do this, morphological analyzer 10 takes the following steps. It traverses the lexicon tree, checking all possible interpretations of each character in the input string (i.e., all the lexical characters that could be paired with the surface character) against letters in lexicon 12. For each interpretation, it moves the rule automaton (i.e., the transducers) to a new state and it keeps those interpretations that do not cause the automaton to block. Before the next character, it checks for all possible pairings of a surface null with a lexical character, i.e., it checks for possible elision. Morphological analyzer 10 retains those pairings for which the automaton does not block. If the end of the input string is reached at the same point that the end of a stem or suffix is reached in lexicon 12, and the automaton is in a final state, then the input word has been successfully parsed.
Through generation, morphological analyzer 10 can also provide all the words lexically related to a given word, and its output can be restricted if certain parameters are specified. If the input word is lexically ambiguous, the output can be restricted to just those words derived from one lexical category of the word. The output can also be restricted to words derived by inflection only or by derivation only, to words of a particular lexical category, or to words derived by adding a particular suffix to the input word.
Generation involves the following sub-tasks. Morphological analyzer 10 first does recognition on the input string to find out: how many parses the input string has; its stem or stems; if it already contains suffixes; and if it is a root associated with an abstract stem. Morphological analyzer 10 then identifies the stem which is to be used for synthesis. If the word is associated with an abstract stem, that abstract stem will be used to generate related words. Next, morphological analyzer 10 searches the continuation class tree associated with the stem to determine what suffixes can be added to it. If the output is to be constrained by the lexical category of the input or output (i.e., suffix rating or part of speech), morphological analyzer 10 checks lexicon 12 to see if the constraints are valid with the given stem. If the output is to be constrained by the addition of a specific suffix, morphological analyzer 10 gets all equivalent suffixes (which are defined and described in greater detail later) from a table of equivalent continuation classes and loosens the constraint to include all suffixes equivalent to the one specified. Morphological analyzer 10 returns all surface strings that can be synthesized from the lexical input and that meet any restrictions specified with the call to generation.
Synthesizing a surface string consists of pairing the characters and symbols in the lexical representation with allowed surface characters in such a way that the conditions specified by the spelling rules are met. To do this, the following procedure is used on each parse of the input word. For each character in the string, morphological analyzer 10 determines all possible interpretations of that character, i.e., all the surface characters that could be paired with it. For each interpretation, analyzer 10 moves the automaton to a new state and keeps those interpretations that do not cause the automaton to block. At specific points, analyzer 10 checks for all possible pairings of a lexical null with a surface character, i.e., it checks for possible epenthesis. Analyzer 10 retains those pairings for which the automaton does not block. If there are no more characters in the input string, and if moving the automaton over the end-of-word marker puts it in a final state, then an acceptable word has been synthesized.
The Recognition Algorithm
Before exploring the details of how the recognition and generation functions operate, an overview of the operation of morphological analyzer 10 will be given using recognition as an example. There are two primary tasks involved in the recognition algorithm (i.e., WFSrecognize); they are: manipulating the automaton and traversing the lexicon. At each step, the recognition function moves the automaton and checks the current states. It then moves down the lexicon tree, checking for entries and, in the suffix lexicon, the current continuation classes.
Manipulating the Automaton
The automaton is built for recognition and generation, accepting surface characters and producing underlying ones. The machine moves from one state to another on pairs of surface and lexical characters and the new state reflects the changes in the automaton as a result of accepting a particular pair of characters. As each of the characters of an input word is processed, all the possible lexical interpretations of an input character are checked. The only combinations of surface and lexical characters that can be used are those that are allowed by the automaton. As a result, (e, s) will not be accepted as a valid pair if there is no charge from a lexical `s` to a surface `e` in the automaton.
Also, before each character, a guess is made that a null character (a character deleted on the surface of the word) needs to be inserted corresponding to a lexical character in the output. Note that all possible interpretations of a character or null character are accepted. Checking, as far as the automaton is concerned, means determining that the automaton does not block after an interpretation is accepted. If the automaton does not block, this is a valid configuration of the recognition machine. When the end of an input word is reached, the recognition machine must also be checked to see that it is in a final configuration. A final configuration is one in which the automaton is in a final state, meaning that all the conditions for any spelling changes that occurred have been met and that any spelling shanges required by context were made.
Traversing the Lexicon
As mentioned above, there is a stem lexicon 16 and a suffix lexicon 20. Two functions are used to manipulate these lexicons, namely, Lookup.sub.-- stems and Lookup.sub.-- suffixes. Lookup.sub.-- stems searches for each possible stem that occurs in the input word using stem lexicon 16, and Lookup.sub.-- suffixes tries to identify the suffixes that have been added to a stem using suffix lexicon 20 and the continuation class tree accessed from the end of a stem. Each subroutine is a recursive function that is called on every valid interpretation of a character in the input word. Using these lexical interpretations, the lexicon (either the stem or suffix lexicon) is traversed character by character until either a valid parse is found or failure is returned.
There are only four possible situations which may occur at any point when traversing either lexicon 18 or 20 and each situation is dealt with in a similar manner in both lexicons. One of the main differences between the two lexicons, however, is the existence of continuation classes and continuation class vectors in suffix lexicon 20. These add extra checks when traversing suffix lexicon 20. Therefore, Lookup.sub.-- suffixes has one more argument than Lookup.sub.-- stems and that is the list of continuation classes that are being searched out for the current call to Lookup.sub.-- suffixes. The four possible situations and the corresponding actions will now be described.
In one situation, there are no more characters left in the input word and there is no end to a stem or valid suffix at the current location in the lexicon. This means the current branch of the parse tree is unsuccessful, so failure should be returned. A valid suffix is one that belongs to a class in a continuation class list which is passed to Lookup.sub.-- suffixes. Even if there are classes at this point in the suffix lexicon, if none match a class in the list, then failure is returned.
In the second situation, there are no more characters in the input word and there is an end to a stem or valid suffix at the current location in the lexicon. At this point, if the automaton is in a final state, then a valid stem or suffix has been found. The information found at this point in the lexicon is returned as a node in the parse tree of the input word.
In the third situation, there are more characters in the input word, but the end of a stem or valid suffix has been found at the current location in the lexicon. At this point, there is a split in the parsing of the input word. If the algorithm is searching for stems, then, to allow for ambiguity and more than one possible stem, the algorithm will go on looking for more stems in the rest of the word and combine the results with the stem found at this point. In the suffix lexicon, any suffixes found must have children in the continuation class tree in order to be considered. Since there are more characters in the input word, they must be used up by children suffixes. Any continuation classes in the list that end here and do not have children are discarded. For any that do have children, Lookup.sub.-- suffixes is called with the class numbers of the children. If this call succeeds, then this suffix is part of the input word and is combined with the information returned. Also, another call is made to Lookup.sub.-- suffixes with any continuation classes whose suffixes are only partially completed at this point. Any parse found down this path is combined with the parses for the finished suffixes and returned.
Finally, in the last situation, there are more characters in the input word and there is no end of a stem or valid suffix (but there are still continuation classes that can be found below this point). In either stem lexicon 16 or the suffix lexicon 20, the program simply continues searching with the next character in the input word. Any parses that are found are simply passed back. In suffix lexicon 20, there may be some continuation classes which have no suffixes down the current branch of suffix lexicon 20. These are discarded from the list of classes being searched for.
Special Considerations
Some of the special considerations and issues that are involved in the recognition and generation algorithms will now be discussed. This includes such things as how backtracking is accomplished, how ambiguities are processed, how abstract stems are handled, and other issues. In the following paragraphs and in the pseudo-code outlines described later, certain terms appear in angle brackets such as <BAS> or <SUF>. These terms are labels denoting specific fields in the entries of continuation class lexicon 18.
The <SUF> label indicates which derivational suffix ends the form. It must be used for irregular derivational entries but cannot be used for irregular inflections.
WFSrecognize uses recursion, modeling a decision tree in its processing. Each of the characters in the input word that is passed to the function presents a possible decision point. The choices at each character include different interpretations of that character, the guess of a null character being inserted, and the different possible continuation classes for a suffix. At a decision point, each possible choice forms a branch from that point and each branch is explored successively. Those that eventually return information are merged and the results are passed back up the tree. If a branch should fail along the way, that is passed back to be discarded by the next higher level as being an invalid branch. If all the possible branches are invalid, then the decision point fails and failure is returned. This type of processing takes care of both backtracking and the case of multiple interpretations, i.e. ambiguity. Since all branches are explored at any level and those that fail are simply rejected, there is no need to backtrack. Also, all possible interpretations at a given decision point are returned, not simply the best choice. As a result, more than one answer can be returned. For example, the word `leaves` will be recognized as both the 3rd-person present form of the verb `leave` and the plural form of the noun `leaf`.
Irregular words are stored as stems in stem lexicon 16 with cross-references to their base forms found inside the corresponding entries for the irregular stem in continuation class lexicon 18. In continuation class lexicon 18, the base form is placed in a field with a <BAS> label. For example, `forgive` is stored as the base form, <BAS>, of `forgave` and `forgiven`. There can be more than one possible base form and, therefore, more than one <BAS> field. The word `more` has three <BAS> fields, one for `much` as an adjective, one for `much` as an adverb, and one for `many`. When the recognition algorithm builds the base history of the input word, it uses the <BAS> fields as the bases of an irregular stem and then builds the rest of the base history with the suffixes that were stripped from the irregular word.
When recognizing or generating an irregular form, it is sometimes useful to know the suffix that was "added" to the base to produce the irregular form. For example, if a thesaurus application took the `ed` suffix from a word such as `donated` and tried to attach it to `give` to generate the same tense, rather than receiving an error it should receive the words `gave` and `given`. In reverse, if trying to retrieve the suffix of `gave` in order to place it on another regular verb, it should be able to retrieve the implied `ed` suffix. In the case of inflectional endings, this implied suffix can be determined from the tense of the irregular form as in the example above: the past tense of a verb is formed by an `ed` suffix. In the case of derivational suffixes, however, such a mapping from part of speech to suffix is not one to one. There are several suffixes which form nouns from verbs, for example, and it cannot be determined from the part of speech which it was. Therefore, a field with a <SUF> label is used to store the suffix that was "added" to create the irregular form, if such a suffix exists. If there is no such suffix, as in the relationship between `succeed` and `success` then the <SUF> field will be empty, and no suffix will be returned.
As mentioned above, there can be more than one <BAS> field associated with a stem form. Also, a stem can be its own base. For example, the word `lay` can be the past tense form of to `lie` or a tenseless verb in itself, or a noun. If it is the past tense form, its base is `lie`. If it is a tenseless verb or a noun, it is its own base. The following is a listing of the base history for the word `lay, which shows how the different base forms should map to the different parts of speech.
______________________________________
History 0: History 1:
Record 0: Record 1: Record 0:
{Word:
lie, {Word: lay, {Word:
lay,
POS: v., POS: v. past,
POS: v.,
Base: Record 0, Base: Record 0,
Base: Record 0,
Suffix:
+O, Suffix: +ed, Suffix:
+O,
Rating:
O} Rating: 1} Rating:
0}
History 2:
Record 0:
{Word:
lay,
POS: n.,
Base: Record 0,
Suffix:
+O,
Rating:
0},
______________________________________
When processing such a word that is ambiguous and whose definitions have different bases, it is necessary to know which base goes with which definition. In order to perform the mapping from base to definition (i.e., part of speech or level 1 continuation class), another type of field with a <MAP> label is included within the <BAS> field. The different definitions of a stem are reflected in the top level classes of its continuation class tree. Whenever there is a need to know which base goes with which definition of a stem, for each <BAS> field, there is a set of <MAP) fields containing the top level classes to which their <BAS> field belongs.
The <MAP> field provides a map between a cross-reference and a level-1 node (level-2 node for abstract entires). The value of the <MAP> field is a continuation class to which the cross-reference corresponds. Note that formatives are valid values in the <MAP> field but they occur only for the abstract stem entries (see below). Entries that have more than one level-1 node and at least one cross-reference (i.e., <BAS>, <IRR>, or <REL>) require <MAP> fields.
Sometimes an entry that requires a <MAP> value for some of its level-1 nodes has nodes that are their own bases. For example, in the entry for `found` three nodes on level-1 (i.e., IPAST, IPPT, and ADJP) require <MAP> to `find` but one node (i.e., VERB) `found` is its own base and does not require a <MAP> to `find`. These nodes are referred to as implicit bases.
There is another kind of stem that requires special consideration. These are the abstract stems discussed earlier. Because abstract stems are not valid words, they cannot appear by themselves in the output of the recognition function but must be replaced either by the contents of <BAS>fields or by a word formed by combining the abstract stem with a root suffix. Level 1 continuation classes are those that contain no suffix, but assign part of speech information to the stem. A special level 1 class is the ABS class which denotes an abstract stem. This class has no suffix and no part of speech information. If there are no <BAS> fields associated with the abstract stem, then the suffix in the first second level class under the ABS class is used to build the root form for all the words based on this abstract stem. Also, the part of speech given by this suffix becomes the part of speech of the root form. During processing, the ABS class is filled in with this suffix and part of speech so that they are available when building the base history of any words based on that abstract stem. If there are <BAS> fields associated with the abstract stem, then the ABS class is not changed but continues with no part of speech or suffix. It points to the other suffixes that have been attached. FIGS. 21a-c give examples of the data structures formed for words with abstract stems, namely, `satisfy` `satisfaction` and `revolution`.
For abstract stem entries implicit and explicit bases never co-exist. If there is no <BAS> field, the first node appearing on level-2 by convention is considered to be the base (called the implicit base). Abstract entries that have more than one level-2 node and more than one explicit base (i.e., <BAS>) require <MAP> in the cross-references (<BAS>, <IRR>, or <REL>) if cross-references do not relate to all nodes at level-2.
As noted earlier, the gemination mark is stored with the other information associated with a stem form. This is a flag that generally denotes that the last syllable of the stem form is stressed, and that the final consonant might be doubled before certain suffixes, as in the word `begged`. There are cases, such as in compound words, where the final consonant should be doubled even though that syllable is not stressed. The gemination flag will appear in these cases as well. There are a very few words that meet the conditions for gemination but should not be geminated, as in `buses`. The gemination flag will be FALSE for these words. If the recognition algorithm finds that the last syllable is stressed, it can proceed as though it has just seen an underlying accent mark with a null character on the surface. The function will move the automaton and the lexicon as though this had taken place and continue with the evaluation. Gemination information has to be inserted before any continuation classes are investigated so that the automaton will take it into account when processing the suffix.
There are two types of ending morphemes that are not handled the same as most suffixes. The first type is the formative ending. It is given a rating of 4, but this rating, and the formative, are never seen by the user application. They are solely for internal purposes. A formative ending is a morpheme that is used as a building block in forming a suffix and does not actually form a new word when added all by itself. For example, `academically` is a valid word, but `academical` is not. The `al` morpheme, although a valid ending in other words, only helps in building the ending `ally` made up of `al` and `ly`. When a formative ending is encountered in the recognition or generation function, it is noted and a base is not generated that ends with that formative ending. If no further endings are given, then that is an invalid parse and failure is returned.
The other type of ending that is handled differently than other suffixes is a suffix with a rating of 3. This rating signifies that when this type of suffix is attached to a word, the resulting word is only "distantly related" to the original form. In such a case, in recognition, the suffix will not be stripped because the resulting word is considered to be a root form in and of itself even though related to the word without the suffix attached. In generation, however, when all the related forms of a stem are desired, these related words will be included with the other words returned.
When stripping a suffix off of one word and adding it to another, another complication arises. Although there is a form of the other word with this suffix, the form of the suffix may be slightly different when used with the new word. For example, some words end in `tion`; whereas, others end in `ation` or just `ion` depending on their structure. The user application of these functions may not know which form of a suffix to use in the different cases. Therefore, when the continuation class of a suffix does not match those belonging to a particular stem, a table is consulted that lists the equivalent suffixes to the one given. The equivalent suffixes are then compared to the continuation class tree to see if there is a valid equivalent for the new word. Appendix C contains a table of suffixes, the continuation classes that they belong to, and the continuation classes that could be substituted for them, i.e., the equivalent classes.
Generally, a continuation class is used to represent a specific suffix. For example, the continuation class ERY refers to the suffix `ery`. Special subsets of these specific continuation classes have been defined for every major part of speech (like noun or verb) to contain the regular endings of those parts of speech. For example, the continuation class set VREG, the set defined for regular verbs, contains the continuation classes for the present third person ending `#S`, the past tense ending `#ed` the past participle ending `#ed`, and the progressive ending `#ing`. These sets each have their own symbol that looks just like other continuation classes in the continuation class structure associated with a stem in the lexicon. When processing these symbols in the recognition process, however, they are seen as macro classes and they are replaced with the subtree of classes they actually represent. There are three examples of such macro classes, namely, VREG, NREG, and ADJREG.
Some very common abbreviations are cross-referenced to the full forms they are derived from. For example, "Dr. Jan. Mon., Mass., lb., gr." are cross-referenced to "Doctor, January, Monday, Massachusetts, pound, gram" through the use of the <IRR> field. In this case, the <IRR> is not a bi-directional link, because the full forms are not cross-referenced to the abbrviations. The expnaded for of the abbregiation is given a REL rating to indicate that the expanded form and the abbreviation are interchangeable.
During the processing of a word, guesses must be made after each character to determine whether or not a null character should be inserted at that point. For example, when generating the word `begged` from `beg#ed` an extra `g` is inserted on the surface which does not appear in the lexical form. This inserted `g` corresponds to a lexical null character. A guess consists of inserting a null character in the input, pairing it with every possible character it could be interpreted as, and for each such pair, continuing to parse until failure or success is reached. In other words, these null/real character pairs introduce several new branches of the parse tree to be explored for every character in the input word. Exploring all these branches can consume a good deal of time, and most of them are useless.
In recognition, there are 8 lexical characters that can be paired with a surface null and each of these 8 characters must be tried. (See Appendix D for a list of all possible pairings that are permitted.) If one is allowed to be inserted by the rules, then another 8 guesses must be made to see if another one is to be inserted, and so on. In generation, there are 17 pairings of a surface character to a lexical null, and thus, 17 guesses to be made between characters and then another 17 if one succeeds, etc. If no other constraints are put into the system, this sort of guessing can grow exponentially and be very costly to sort out.
There is contextual information, however, not available to the automaton when this processing is going on that can be used by the morphological analyzer to help control this guessing. For instance, a null constraint function can be used to indicate to the morphological analyzer that a lexical boundary symbol paired with a surface null will only occur between morphemes and not between each character and thus it need only try those types of pairs when the end of a stem or suffix is found. Such a null constraint criterion will reduce the number of guesses between characters by two. The different constraints that the null constraint function imposes on the morphological analyzer for null characters are given below.
(1) In between the stem and suffixes: In recognition, this is a very special place where only certain things can happen. This is the only place where a stress mark paired with a surface null will be guessed at because this is the only location where stress matters. After the optional stress mark, three scenarios could take place, namely, a suffix may follow, one character may have been inserted before the suffix (on the surface) or two characters may have been inserted. Thus, each suffix boundary is guessed at followed by an analysis of the rest of the characters in the input string. Then, a lexical null character paired with the next input character is "guessed" and if this is allowed by the rules, each boundary character is again tried followed by the rest of the input string. Finally, a second lexical null is tried and if this is allowed by the automaton, each boundary character is tried a third time followed by the remaining characters in the input string.
(2) In between suffixes: When one suffix has finished and another is about to be started, this is the only other place besides after the stem where a boundary character should be tried. Boundary characters are not "guessed" at any other places.
(3) In between each character: In between each character all interpretations of a null character (except those mentioned in preceding paragraphs (1) and (2) for recognition, the stress mark and the boundary characters) can be "guessed" to be inserted. The number of guesses between each character for recognition is reduced to 5 because of the constraints in (1) and (2). In generation, the special characters of (1) and (2) are already present in the input string and are not guessed at. As a result, the number of guesses for generation is not reduced by (1) and (2). However, another constraint is used for both recognition and generation. By looking at the rules in their entirety, it can be determined before which characters a null character will be allowed. With this information, instead of trying a null character and then trying the next one and then having the automaton block, the next character can be used to determine if a null should be "guessed". The set of characters that allow nulls is different depending on whether recognition or generation is being done. In the described embodiment, there are two functions that are called to quickly check if the next character allows a null to be inserted: rec.sub.-- nullcheck and gen.sub.-- nullcheck. The characters that allow this are as follows:
Recognition: t,l,y,a,e,i,o,p
Generation: +,#,1
The Internal Routines of the Morphological Analyzer
The internal routines which perform the recognition and generation functions of the morphological analyzer will now be described. As noted above, the recognition function is performed by WFSrecognize and the generation function is performed by WFSgenerate. WFSrecognize calls two functions which manipulate the lexicons, namely, Lookup.sub.-- stems and Lookup.sub.-- suffixes, to parse the input word and to generate a corresponding parse tree that identifies all of the successful parses of an input word. WFSrecognize then calls two other routines, namely, Make.sub.-- base.sub.-- history and Base.sub.-- hist.sub.-- loop to produce a base history for the resulting parse tree.
The WFSrecognize Function
WFSrecognize takes the surface form of a word and returns a history identifying the different stems found within that word along with the suffixes required to generate a successful parse of the complete word for each of the stems. Using the surface form of the word passed, WFSrecognize (by calling another function referred to as Lookup.sub.-- stems) searches for stems within the word by beginning at the first character and moving to the end of the word one character at a time. After the stems are identified, WFSrecognize (by calling another function referred to as Lookup.sub.-- suffixes) identifies all suffixes that are attached to those stems and which result in a successful parse of the word. The recognition process yields a parse tree identifying all of the successful parses of the word. WFSrecognize converts the resulting parse tree into a history which is passed back to the user or the application which called WFSrecognize.
Two parameters must be passed to WFSrecognize when it is called, namely input.sub.-- word and WFS.sub.-- info. Input.sub.-- word is a text string for which a parse tree history is desired. Since WFSrecognize only operates on a single word at a time, input.sub.-- word must have no spaces within it. WFS.sub.-- info is a parameter containing pointers to a number of data structures, including a pointer to the automaton, a pointer to the stem lexicon (stem.sub.-- lexicon), a pointer to the suffix lexicon (suf.sub.-- lexicon) and a pointer to the continuation class lexicon (cc.sub.-- lexicon).
Referring to FIGS. 22a-g, when first called, WFSrecognize determines whether input.sub.-- word satisfies this single word requirement by searching for spaces within input.sub.-- word (step 500). If a space is found, WFSrecognize returns an error to the user indicating that it has received a phrase (step 502). If there are no spaces found within input.sub.-- word, WFSrecognize scans it for illegal characters (step 504). If any are found, it notifies the user that input.sub.-- word is an illegal word and then terminates further processing of that word (step 506). Once it is determined that input.sub.-- word is a valid word, WFSrecognize checks whether it is an abbreviation (step 508). To do this, WFSrecognize checks the word form for the presence of capitalization at the beginning and/or a period at the end of the word. If these indicators of an abbreviated word are found, WFSrecognize looks input.sub.-- word up in the stem lexicon to find its entry point in the continuation class lexicon. The entry point identifies a data structure within the continuation class lexicon which contains a flag indicating whether the word is an abbreviation. If input.sub.-- word is an abbreviation, the words for which it is an abbreviation are also identified through a cross reference. For each of the cross references identified for the abbreviation, WFSrecognize generates a corresponding base record and returns control to the program which called it (step 510).
If input.sub.-- word is not an abbreviation, WFSrecognize initializes the stem lexicon in preparation for a search for the stems of input.sub.-- word (step 512). Since the entire stem lexicon is too large to have the entire contents of the file in memory at the |