Method for verifying spelling of compound words4777617Abstract This invention describes a method for automatically verifying spelling of compound words in many natural languages such as German, Danish, Swedish, Norwegian, Dutch, Icelandic, Afrikaans, Swiss German, etc. The basic technology of looking up words in a dictionary is supplemented by the association of component flags with each word and by the application of powerful tree-scanning techniques that isolate the components of compound words and determine their correctness in isolation and in association with each other. The technique can be used in word processing systems to support spelling verification, to hyphenate text, and to unhyphenate text. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE I
__________________________________________________________________________
TREE SCANNING PROCEDURE
FOR PARSING COMPOUND WORDS
__________________________________________________________________________
/* GENERAL PARAMETERS */
DCL
WORD CHAR(MXLENGTH);/* Word to be decompounded
*/
DCL
MXLENGTH FIXED BIN(15); /* Maximum word length
*/
DCL
WORDLEN FIXED BIN(15); /* Length of WORD
*/
DCL
SWORD CHAR(MXLENGTH);/* Substring of word
*/
/* STACK FOR BACKTRACKING */
DCL
1 STACK,
2 TABNOMAX FIXED BIN(15), /* Maximum no. of components
*/
2 TAGMAX FIXED BIN(15), /* Maximum no. of substrings
*/r
/* a component level */
2 TABNO FIXED BIN(15), /* No. of current component
*/vel
2 TABLE(TABNOMAX)
3 POS FIXED BIN(15), /* Starting position of substring
*/
3 NUM FIXED BIN(15), /* No. of substrings for
*/rrent
/* level */
3 TAG FIXED BIN(15), /* Current substring number
*/
3 LEN(TAGMAX)
FIXED BIN(15), /* Length of substrings in
*/rd
/*
PSEUDOCODE FOR DECOMPOUNDING */
/* */
/* Set up first table entry (first level */
TABNO = 1;
TAG(TABNO) = 1;
POS(TABNO) = 1;
/* Point to first character of WORD
*/
SWORD = WORD; /* Set SWORD equal to WORD */
FORWARD:
/* Find the leading substrings of SWORD that exist in the
*/
/* dictionaries. Store number of substrings found and the
*/
/* length of each substring in the stack.
*/
CALL FIND --SUBSTRINGS(SWORD,STACK);
/* If no substring is found, backup to the previous level
*/
IF NUM(TABNO) = 0 THEN GOTO BACKWARD; */
FLAG --CHECK:
/* Examine the compound flag of the substring for validity
*/
CALL FLAGCHECK(WORD,WORDLEN,STACK,FLAGRC);
/* If the flag is valid (FLAGRC = 0), then check if the
*/
/* substring exhausts the word. If it does, we have found
*/
/* a valid compound word. If it does not then go to the
*/xt
/* level and finding the next subword to be decompounded.
*/
IF FLAGRC = 0 THEN DO;
IF POS(TABNO) + LEN(TABNO,TAG) - 1 = WORDLEN
THEN GOTO SUCCESS; /* Complete word OK */
POS(TABNO+1) = POS(TABNO) + /* Start of new substring
*/
LEN(TABNO,TAG);
TABNO = TABNO +1; /* Increment level */
/* SWORD contains remaining substring of the word.
*/
SWORD = SUBSTR(WORD,POS(TABNO),WORDLEN)-POS(TABNO));
TAG(TABNO) = 1; /* Set TAG=1 */
GOTO FORWARD; /* Try new candidate */
END;
/* If the flag is not valid (FLAGRC=1), then if there is
*/
/* another substring at this level, examine the flag for
*/
/* this substring. Otherwise, go back to the previous level
*/
/* after checking morphological possibilities.
*/
IF TAG(TABNO) < NUM(TABNO) THEN DO; /* More substr.this
*/vel?
TAG(TABNO) = TAG(TABNO) + 1; /* Next substr.this level
*/
GOTO FLAG --CHECK; /* Check its flag */
END;
/* No more substrings at this level */
/* Apply language-specific morphological procedures.
*/
CALL MORPH(WORD,WORDLEN,STACK);
BACKWARD:
/* Calculate the previous level. If the resulting level
*/
/* zero, we have failed to find a valid word
*/
TABNO = TABNO -1;
IF TABNO = 0 THEN GOTO FAIL;
/* If there is another substring at this level, try this
*/
/* substring. Otherwise, go back to the previous level.
*/
IF TAG(TABNO) < NUM(TABNO) THEN DO;
TAG(TABNO) = TAG(TABNO) + 1;
GOTO FORWARD:
END;
GOTO BACKWARD;
/* END OF TREE SCANNING PROCEDURE */
__________________________________________________________________________
Letter Elisions There are circumstances when letters are elided for linguistic reasons when two words are combined and the resulting compound word is not strictly a juxtaposition of the components. The invention is able to process these cases for certain general cases. In German, for example, a letter is elided when a word ending in a double consonant is combined with another word that starts with the same consonant. For example, the word "Schiffahrt" (German for "boat trip") is constructed from "Schiff" ("boat") and "Fahrt" ("trip") by eliding an "f." Therefore, as part of the morphological transformations performed by the invention, when no suitable dictionary entries can be found for the remainder of the word (in this case for "ahrt"), a check is made to see if the previous component ends in a double consonant and the next character a vowel. When these conditions are found, the starting character of the remaining string is set to last character of the previous word. In this way the program checks for the word "fahrt." Letter Insertions A case analogous to letter elisions is the case of letter insertions. In German this happens with the genitive case where the letter "s" is added to the initial component. The invention handles this problem by checking to see if the remaining component starts with "s" whenever no suitable match can be found. If an "s" is present, it is skipped and a match is attempted on the remaining portion of the compound word. Capitalization Just as English requires proper nouns and adjectives derived from them to be capitalized (e.g., American), so German grammar requires that all nouns be capitalized. Thus, a German dictionary contains many words starting with upper case. This presents a problem when matching a compound word, since only its first character may be capitalized and any internal components will start in lower case. This is solved by the invention by ordering the dictionary in alphabetical order without regard to case (but indicating case as an additional attribute) and by converting the dictionary words and the compound word to a common character set at the time of matching. Specific Morphological Transformation In addition to the features that characterize the usage of a component in a compound word, codes specifying specific transformations can be associated with each word in the dictionary. For example, the Afrikaans language has a large number of words for which the combining non-back form of the genitive singular has the same spelling as the nominative plural, both adding "s" to the nominative singular. When such a word is compounded with a word starting with "s," one of them must be elided. This can be indicated by associating a special code with such words. For example, such a code associated with the word "mans" would result in elision of an "s" when combined with "skool:" mans+skool=manskool. Such specific transformations are represented by codes in the dictionary and enable the decomposition of the compound word into its components. Notice that this technique relies on the specific encoding of the attribute within the dictionary rather than as a part of the algorithm as indicated for elisions above. Support of Hyphenation One of the by-products of compound word analysis as described in this invention is the state of the stack after a compound has been processed. The stack contains the information that identifies the components of the compound, and hence, the major hyphenation points of the compound. These hyphenation points may be supplemented by hyphenation information stored for each component. In addition, the fact that the invention identifies letter elisions makes it possible to hyphenate correctly such words. For example, the German word "Schiffahrt" would be correctly hyphenated as "Schiff-fahrt" by restoring the elided letter. The flow diagram of FIG. 3 is implemented with a stack. Step 20 says to get the text word. This is a either text from an input buffer or it can be supplied by another computer program. In Step 22, we clear the stack. That means we set the stack to empty and we set the level equal to one. The procedure by which we scan the stack is basically a pre-order form and that means that you set a chain as in FIG. 4, starting with the node "a" which splits off into nodes "b" and "c." The nodes "b" and "c" are at the level 2 and then if "b" splits into nodes "d," "e" and "f," this is the third level. If node "c" splits into nodes "g" and "h," again this is the third level. The tree would be scanned by going from "a" to "b" to "d," then "e," then "f," back to "b," back to "a" and then to "c" to "g" and then "h," back to "c" and back to "a." That would terminate the scan of the tree. The disclosed invention does the equivalent function for trying the various possibilities of finding substrings within a word for verification. We now proceed to the label "Forward." At this point, we get the text word in Step 24; in this case, there is no change in the text word because it has not been altered. It is the same as that which was originally input. In subsequent looping through the flow chart, it would be a modified version of the word. Then we proceed to Step 26 labeled "Check dictionary for matching substrings." At this stage, we check the dictionary and any dictionary entries which match exactly against the initial portion of the string are identified. If none of the words match, we go to the label "Backward" where we are going to decrement the levels and try other possibilities by going to other branches of the chain. But if there are some matching entries, then this creates a new level, so we increment the level in Step 28. In order to account for upper case, we have to check if the level is greater than one in Step 30. That means it is not an initial substring of the word, and we disregard the upper case that may be in the word stored in the dictionary. At this point, we come to Step 32 labeled "Store number of substrings at current level in their length in stack." Basically this is just a mechanism for storing the strings for future reference. It is storing the place where the string starts and its length for the current level. By having stored this information, we proceed to check the flag of the substrings in Step 32. If we find an unsuitable flag, then we go to Step 36 marked "More substrings at the current level." We continue checking those substrings until we either do not find any substrings at the current level or until we find a substring with a suitable flag. When we find the substring with the suitable flag, we proceed to Step 38 "Apply transformation specified by flag or code, if any." In this step, transformations requiring additions of letters are performed. After these transformations are applied, then we check to see if the substring is at the end of the word in Step 40. If it is, then we can say that the word verified because all the components have been checked and they have appropriate flags, so at this point, we exit at Step 42. However, if the substring is not at the end of the word, now we remove from the word that initial substring in Step 44 and keep the remainder (the right part) of the substring and then go back to the label "Go forward" to repeat the process with the remaining substring. In this way, each segment of the word is iteratively processed until the whole word has been recognized. In Step 46 labeled "Apply morphological transformations," whenever there are no more substrings at the current level, the operator must take into consideration whether there were any transformations applied between the words. At this point, those transformations are applied. If the transformation is successful, again we go to the label "Forward," continue processing the string. If we cannot transform it, then we go back to the label "Backward." "Backward" is really the process of recovering from a failure along the branch and if we can explore the branch of a tree which does not lead to a complete recognition of the word, then we decrement the level in Step 50 after we have tried all the possibilities at that level (Step 52). If the level is zero in Step 54, then we know that the word does not verify (Step 56) because we have tried all the possibilities. If the level is not zero and there are more substrings at the current level, we continue with "Forward" until we exhaust all the possibilities. Although a specific embodiment of the invention has been disclosed, it will be understood by those of skill in the art that changes can be made to the invention without departing from the spirit and the scope of the invention .
|
Same subclass Same class Consider this |
||||||||||
