Chinese word segmentation apparatus6879951Abstract A Chinese word segmentation apparatus relates to processing of a Chinese sentence input to a computer. A character-to-phonetic converter of the segmentation apparatus initially converts a Chinese sentence into a phonetic symbol string while referring to a character phonetic dictionary and a ductionary for characters with different pronunciations. Thereafter, a candidate word-selector refers to a system dictionary to retrieve all of the possible candidate characters or words in the phonetic symbol string and relevant information, such as frequency of use, using the phonetic symbols as indexing terms. Unfeasible candidate characters or words are discarded. Subsequently, an optimum candidate character string-decider builds a candidate word network using starting and ending positions of each candidate character or word in the input sentence as indexing terms. By referring to semantic and syntax information portions, frequency of use prioritization, word length prioritization, semantic similarity prioritization and syntax prioritization are combined to obtain a total estimate. The optimum route for word segmentation marking portion adds word segmentation markers into the input sentence while referring to the optimum route to complete word segmentation. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
semantic Code Description
0 Nature Class
02 Weather Sub-class of the
Nature Class
028 Wind Section of the Weather
Sub-class
028a Strength Sub-section of the
Wind Section
In the aforesaid subdivided-type classification code, the higher the rank of the semantic code, the broader will be the scope of semantic code that is covered thereby. Accordingly, the lower the rank of the semantic code, the narrower will be the scope of semantic code that is covered thereby. Thus, the semantic code as such can be applied to meet the actual requirements. For example, to represent weather, only the codes 02 need to be used. There is no need to expand the codes 02 to 021, 022, etc., thereby reducing the memory space. Moreover, since these semantic code are expressed in terms of numbers, they can be used in mathematical computation methods, such as in set logic computations, for processing the semantic code to derive more information of value. As to the detailed description of the semantic code, one may refer to R.O.C. Patent Publication No. 161238, entitled "Machine Translator Apparatus," the entire disclosure of which is incorporated herein by reference. In addition, according to R.O.C. Patent Publication No. 089476, entitled "Chinese Character Transforming Apparatus (II)," the entire disclosure of which is incorporated herein by reference, when converting a Chinese phonetic symbol string into a character string, the word length is an important factor to be considered. In this embodiment, word length prioritization is also one of the factors considered in word segmentation. The calculation thereof is as follows: Word length prioritization=(Number of characters in candidate word-1)*2 For example, if the candidate word is "{character pullout}" the word length prioritization therefor is (3-1)*2=4. Furthermore, the preferred embodiment of this invention also involves syntax information as an enhancing factor in word segmentation. As shown in FIG. 9, the syntax information involves automatic learning of a marked large vocabulary database to refer to word categories, such as noun, adjective, verb, etc., of two words connected back-to-back in order to obtain a two-dimensional array. A value of 0 indicates that the two word categories cannot be placed beside each other, while a value of 1 indicates that the two word categories can be placed beside each other. The definition of syntax prioritization as a factor in word segmentation estimation is as follows: Syntax prioritization=Syntax information value of (front-part word category, rear-part word category)*5 In addition, the preferred embodiment of this invention also involves semantic information as an enhancing factor in word segmentation. As shown in FIG. 10, the semantic information also involves automatic learning of the marked large vocabulary database to obtain continuity semantic information. Since the semantic code in use employ the subdivided-type format, calculation of the semantic similarity degree of back-to-back consecutive words can be done using set intersection computations. For example, the result of a set intersection computation for semantic code "7140" and "714a" is "714". Since the result of the computation only includes three codes, the semantic similarity degree is deemed to be 3/4. Accordingly, if the result includes four codes, the semantic similarity degree is deemed to be 1. If the result includes only two codes, the semantic similarity degree is deemed to be 1/2. If the result includes only one code, the semantic similarity degree is deemed to be 1/4. If the result is a null set, the semantic similarity degree is deemed to be 0. FIG. 1 illustrates a schematic system block diagram of the preferred embodiment of a Chinese word segmentation apparatus according to the present invention. As shown in this figure, 250 denotes a dictionary for characters with different pronunciations that is used to store all of the characters in the Chinese language with different pronunciations, all of the character phonetic symbols corresponding to the characters with the different pronunciations, and all of the candidate words and word phonetic symbols corresponding to each of the character phonetic symbols. The dictionary 250 is shown in FIG. 6. 260 denotes a character phonetic dictionary that is used to store all of the characters in the Chinese language, the initial preset phonetic symbols corresponding to the characters, and other possible phonetic symbols for the characters. The character phonetic dictionary 260 is shown in FIG. 7. 350 denotes a system dictionary that is used to store phonetic symbols of Chinese characters or words, similarly sounding conflicting characters or similarly sounding conflicting words corresponding to each of the phonetic symbols, and frequency of use, syntax marker and semantic marker corresponding to each of the similarly sounding conflicting characters or similarly sounding conflicting words. The system dictionary 350 is shown in FIG. 8. 440 denotes a syntax information portion that is used to store a two-dimensional array formed from "1" or "0" bits to indicate whether or not different word categories can be connected in the Chinese language. The syntax information portion 440 is shown in FIG. 9. 450 denotes a semantic information portion that is used to store rear-part semantic code of Chinese words and possible front-part semantic code corresponding to the rear-part semantic code. The semantic information portion 450 is shown in FIG. 10. 100 denotes an input portion, such as a keyboard, for inputting a Chinese character string. 200 denotes a character-to-phonetic converting portion that refers to the dictionary 250 for characters with different pronunciations and to the character phonetic dictionary 260 in order to convert the Chinese character string inputted from the input portion 100 into a phonetic symbol string. 300 denotes a candidate word-selecting portion that is used to cut the phonetic symbol string obtained from the character-to-phonetic converting portion into syllables, to obtain all possible candidate words from the system dictionary 350 by using each of the syllables as an indexing term, and to discard unfeasible candidate words by referring to the inputted character string from the input portion 100. 400 denotes an optimum candidate character string-deciding portion that is used to interconnect the candidate words in the form of a directional network using starting and ending positions of each of the candidate words in the inputted character string from the input portion 100 as indexing terms, to calculate semantic similarity degree prioritization and syntax prioritization by referring to the syntax information portion 440 and the semantic information portion 450 while taking into account the syntax markers and the semantic markers of every two back-to-back candidate words, to obtain a total estimate that is a function of frequency of use prioritization, word length prioritization, syntax prioritization and semantic similarity degree prioritization, and to find a route for achieving an optimum estimate grade for word segmentation using a dynamic programming method. 500 denotes a word segmentation marking portion that is used to retrieve in sequence the candidate words in the optimum route and to add segmentation markers thereto. 600 denotes an output portion for outputting the marked character string. 700 denotes a buffer region formed from a memory device for providing temporary storage of the input character string and the intermediate processing results. FIG. 2 illustrates the process flowchart of the character-to-phonetic converting portion 200. In step s201, the input Chinese character string from the input portion 100 is stored in the buffer region 700. In step s205, the input Chinese sentence is cut into syllables with reference to the character phonetic dictionary 260. In step s210, the phonetic symbols for syllabicated characters that do not have different pronunciations are generated with reference to the character phonetic dictionary 260. In step s215, the phonetic symbols for syllabicated characters that have different pronunciations are generated with reference to the dictionary 250 for characters with different pronunciations in a sequence from the tail end to the head end of the character string. In step s220, simple syntax rules are used to correct the phonetic symbols. For example, the phonetic symbols for the word "{character pullout}" after conversion are "{character pullout} . . . {character pullout} . . . ". However, the second syllable is actually read with a light sound. Thus, in this step, the phonetic symbols are corrected with reference to the syntax rules into "{character pullout}{character pullout}.cndot.". Processing ends after step s220. FIG. 3 illustrates the process flowchart of the candidate word-selecting portion 300. Instep s301, the phonetic symbol string transmitted from the character-to-phonetic converting portion 200 is cut into syllables with reference to the system dictionary 350. In step s305, the candidate words and the relevant semantic information, syntax information and frequency of use information are retrieved from the system dictionary 350 using each syllable of the phonetic symbol string as the indexing term. In step s310, the input character string is retrieved from the buffer region 700. In step s315, with the characters and phonetic symbols of the candidate words as indexing terms, unfeasible candidate words are discarded using matching means while referring to the input character string and the phonetic symbol string. In step s320, the remaining possible candidate words and the relevant position information, semantic information, syntax information and frequency of use information are stored in the buffer region 700. Processing is subsequently terminated. FIG. 4 illustrates the process flowchart of the optimum candidate word string-deciding portion 400. In step s401, the possible candidate words and the relevant information are retrieved from the buffer region 700. In step s405, a directional network for the candidate words is constructed using the position information of each candidate word as an indexing term. For example, when the word tail end position information of a front candidate word is 4 (the fourth character in the input character string), and the word head end position information of a rear candidate word is 5 (the fifth character in the input character string), this indicates that the two candidate words can be connected. Instep s410, the word length prioritization, the syntax prioritization, and the semantic similarity degree prioritization are calculated. Thereafter, a total estimate that is a function of the frequency of use, the word length prioritization, the syntax prioritization and the semantic similarity degree prioritization is calculated. After a dynamic programming model to find the optimum route, the candidate words in the optimum route are sequentially obtained and outputted. Processing is subsequently terminated. FIG. 5 illustrates the process flowchart of the word segmentation marking portion 500. In step s501, the optimum candidate word sequence (A) is transmitted from the optimum candidate word string-deciding portion 400. In step s505, the input character string (B) is retrieved from the buffer region 700. In step s510, the sequence (A) and the sequence (B) are compared using matching means, and word segmentation markers are marked in the sequence (B). In step s515, the marked character string is outputted to the output portion 600. Processing is terminated at this time. In the example where "{character pullout}{character pullout}{character pullout}" is inputted using the input portion 100, the character-to-phonetic converting portion 200 of the Chinese word segmentation apparatus of this invention initially processes the same. First, the characters in the sentence that do not have different pronunciations are converted with reference to the character-to-phonetic dictionary 260 to obtain the result "ba3ta1 {character pullout} qyue4sh2 {character pullout} dong4zuo4 {character pullout} ian2jiou4". Thereafter, starting from the tail end to the head end of the sentence, it is found by referring to the dictionary 250 for characters with different pronunciations that the characters "{character pullout}" and "{character pullout}" do not form a corresponding word. Thus, the character "{character pullout}" is converted to the initial preset value "le0". By the same logic, with reference to the dictionary 250 while using the characters "{character pullout}" as an indexing term, it is determined that the pronunciation therefor is "xing2dong4". Thus, the character "{character pullout}" is converted to "xing2". Thereafter, while the characters "{character pullout}" have a corresponding candidate pronunciation in "di2qyue4," since the pronunciation of the characters "{character pullout} {character pullout}" is "de0qyue4sh2xing2dong4zuo4," the pronunciation "di2qyue4" of the characters "{character pullout}" will be abandoned, and the character "{character pullout}" will be converted to "de0" because of the longer word priority rule. Thus, the result of the conversion from character string to phonetic symbol string is as follows: "ba3ta1de0qyue4sh2xing2dong4zuo4le0ian2jiou4" The conversion result, together with the input character string, are stored in the buffer region 700. Subsequently, the candidate word-selecting portion 300 operates according to the process flowchart of FIG. 3. By referring to the system dictionary 350, the phonetic symbol string is cut into all possible syllables as follows: ba3-ta1-de0-qyue4-sh2-xing2-dong4-zuo4-le0-ian2-jiou4 ba3-ta1-de0-qyue4sh2-xing2-dong4-zuo4-le0-ian2-jiou4 ba3-ta1-de0-qyue4-sh2xing2-dong4-zuo4-le0-ian2-jiou4 ba3-ta1-de0-qyue4-sh2-xing2dong4-zuo4-le0-ian2-jiou4 ba3-ta1-de0-qyue4sh2-xing2dong4-zuo4-le0-ian2-jiou4 ba3-ta1-de0-qyue4sh2-xing2-dong4-zuo4-le0-ian2jiou4 ba3-ta1-de0-qyue4-sh2xing2-dong4-zuo4-le0-ian2jiou4 ba3-ta1-de0-qyue4-sh2-xing2dong4-zuo4-le0-ian2jiou4 ba3-ta1-de0-qyue4sh2-xing2dong4-zuo4-le0-ian2jiou4 Thereafter, with the use of the possible syllables of the phonetic symbols as indexing terms, the following exemplary possible candidate words are obtained with reference to the system dictionary 350: ba3 ta1 de0 qyue4 sh2 xing2 dong4 zuo4 le0 ian2 jiou4
##STR1##
Subsequently, with reference to the input character string "{character pullout}{character pullout}{character pullout}" stored in the buffer region 700 and the corresponding position information, comparing means is employed to eliminate the candidate words different from the input character string. The possible candidate words are as follows: ba3 ta1 de0 qyue4 sh2 xing2 dong4 zuo4 le0 ian2 jiou4
##STR2##
Thereafter, relevant information, such as the semantic information, syntax information, frequency of use information, etc., from the system dictionary 350 and the position information for each of the candidate words are stored in the buffer region 700. Then, the optimum candidate character string-deciding portion 400 retrieves the possible candidate words and the relevant information from the buffer region 700. Based on the position information of each candidate word (i.e. information as to whether or not candidate words can be placed back-to-back), a directional network is constructed as follows: ##STR3## Next, the optimum candidate character string-deciding portion 400 calculates the word length prioritization, the syntax prioritization, and the sematic similarity degree prioritization. A total estimate that is a function of the frequency of use, the word length prioritization, the syntax prioritization and the semantic similarity degree prioritization is then calculated. After a dynamic programming method, the optimum route sequence is found to be ##STR4## Finally, the word segmentation marking portion 500 retrieves the input character string from the buffer region 700 and, based on the optimum character string sequence, inserts markings the input character string as follows: "{character pullout}*{character pullout}*{character pullout}*{character pullout}*{character pullout}*{character pullout}*{character pullout}*{character pullout}". The marked character string is then provided to the output portion 600. From the foregoing, it is apparent that the Chinese word segmentation apparatus of this invention can overcome the problems associated with the prior art. The effects of the present invention are as follows: 1. There is no need for a large vocabulary database, and a Chinese word segmentation accuracy of more than 98% can be achieved. 2. The possible candidate words can be reduced to a minimum to substantially increase the operating efficiency. 3. The apparatus can make use of existing Chinese character to phonetic technical conversion resources, such as computation means, system dictionary, etc. to achieve maximum results with less effort. 4. Not only can word segmentation be performed, the problems associated with different word categories can also be overcome. While the present invention has been described in connection with what is considered the most practical and preferred embodiment, it is understood that this invention is not limited to the disclosed embodiment but is intended to cover various arrangements included within the spirit and scope of the broadest interpretation so as to encompass all such modifications and equivalent arrangements.
|
Same subclass Same class |
||||||||||
