Method for conversion of phonetic Chinese to character Chinese5270927Abstract A method for converting phonetic Chinese into Chinese characters effectuated by ascertaining an optimal path through a lattice of possible combinations of Chinese characters by calculating the probability of adjacent Chinese characters appearing in Chinese character text. Claims What is claimed is: Description FIELD OF THE INVENTION
______________________________________
tai2 wan1 you3 tai2 feng1
______________________________________
______________________________________
The circled Chinese characters represent the most likely path through the lattice, that is, the sentence in Chinese: "Taiwan has typhoons." I will now describe the methodology used to select the optimum path in the above-described example in accordance with the principles of my invention. Step 1 of FIG. 1 identifies the starting point of my method as the Chinese syllables S.sub.1 through S.sub.n or, in the illustrative example, S.sub.1 through S.sub.5 corresponding to the Chinese syllables "tai2", "wan1", "you3", "tai2" and "feng1", respectively. In accordance with step 2 of FIG. 2, Chinese characters C.sub.1 through C.sub.Z, representing all possible Chinese characters corresponding to a particular Chinese syllable, are generated for each Chinese syllable S.sub.i, where the numbers of such characters "z" will vary for each Chinese syllable. Thus, with reference to FIG. 2 which shows the matrix formed by the generation of such Chinese characters, "z" is 9 for "tai2" since "tai2" represents one of nine possible characters, "z" is 5 for "wan1" since "wan1" represents one of five possible characters, "z" is 6 for "you3" as shown, and "z" is 9 for "feng1" since, as shown, "feng1" represents one of nine possible characters. In accordance with step 3 of FIG. 1, the generated Chinese characters are formed into a matrix where the characters generated for the first syllable form the left hand column C.sub.1 and the characters generated for the last syllable (number n) form the right hand column C.sub.n. Thus, with reference to the matrix in FIG. 2, Chinese character C.sub.1 for "tai2" (which is really character C.sub.11 in the matrix following the numbering methodology of step 3 of FIG. 1) is the first Chinese symbol under "tai2", Chinese character C.sub.2 for "tai2" (which is character C.sub.12 in the matrix) is the second Chinese symbol under "tai2", Chinese character C.sub.3 for "tai2" (which is character C.sub.13 in the matrix) is the third Chinese symbol under "tai2", and so on until Chinese character C.sub.9 for "tai2" (which is character C.sub.19 in the matrix) is the ninth Chinese symbol under "tai2". Chinese character C.sub.1 for "wan1" is the first Chinese symbol under "wan1" and would be designated C.sub.21 in accordance with the numbering scheme of FIG. 1 since it is the first symbol for the second character. Similarly, Chinese character C.sub.2 for "wan1" is the second Chinese symbol under "wan1" and would be designated C.sub.22 in accordance with the numbering scheme of FIG. 1 since it is the second symbol for the second character. Finally, the last Chinese character C.sub.5 for "wan1" is the fifth Chinese symbol under "wan1" and would be designated C.sub.25 in accordance with the numbering scheme of FIG. 1 since it is the fifth symbol for the second character. Chinese character C.sub.1 for the last symbol "feng1" (i.e., S.sub.5) is the first Chinese symbol under "feng1" and would be designated C.sub.51 in accordance with the numbering scheme of FIG. 1 since it is the first symbol for the fifth character. The last symbol in the matrix found at its bottom right corner is C.sub.9 for "feng1" since it is the ninth symbol under "feng1" and is designated C.sub.59 since it is the ninth symbol for the fifth character. FIG. 3 depicts the same Chinese character matrix as that shown in FIG. 2, but, where for ease of explanation later of the selection of the optimum path through this matrix, some of the Chinese symbols have been replaced by the English designations "a" through "h". Thus, "a" in FIG. 3 represents the first symbol under "tai2" in FIG. 2, "b" in FIG. 3 represents the second symbol under "tai2", "c" in FIG. 3 represents the first symbol under "wan1" in FIG. 2, and so on. These designations are arbitrary (except identical Chinese characters obviously are identified by the same English designation) and serve only to facilitate further explanation of the principles of my invention. FIG. 4 shows a frequency matrix reflecting the use of adjacent Chinese characters (C.sub.i-1, C.sub.i) in a large corpus of Chinese text. This matrix was derived by analysis of the corpus and noting the number of times ordered pairs of Chinese characters appeared in the text. Again, for ease of explanation, the same English letters "a" through "h", used in FIG. 3, are again used respectively to identify the adjacent Chinese characters shown in FIG. 4. The letter "i" refers to the Chinese sentence delimiter (Chinese `period`), and as hereinafter explained, is used to delineate the beginning and the end of phrase of Chinese text. Since the number of Chinese characters is quite large, FIG. 4 shows only a representative portion of the actual frequency matrix which is approximately a 6000 by 6000 matrix and represents use of 6000 Chinese characters. FIG. 4 shows the number of times ordered pairs of Chinese characters (C.sub.i-1, C.sub.i) appeared in the corpus. For example, the pair "aa" appeared 5 times in the Chinese corpus since the number 5 appears for C.sub.i-1 =a, and C.sub.i =a. Similarly, the pair "ab" appeared 0 times since the number 0 appears for C.sub.i-1 =a, and C.sub.i =b. On the other hand, the pair "ac" appeared 1513 times (see, C.sub.i-1 =a, and C.sub.i =c). One notices immediately, that while the pair "ac" is found quite frequently in Chinese text, its inverse--namely "ca"--is found only infrequently (e.g., "1" in FIG. 4). One notices readily that many pairs were not found at all in the corpus--see, for example, "ab", "af", "ah", "ba", "bb", "bc" and so on. The letter "i" in FIG. 4 refers to the Chinese sentence delimiter and is used to pad both ends of an input sequence. Thus, where "a" is the first character, the ordered pair (C.sub.i-1, C.sub.i) really constitutes the pair "ia". Similarly, where "a" is the last character, the ordered pair (C.sub.i-1, C.sub.i) really constitutes the pair "ai". FIG. 4, therefore, indicates that "a" was the first character 322 times (see C.sub.i-1 =i, C.sub.i =a) and that "a" was the last character only 22 times (see C.sub.i-1 =a, C.sub.i =i). Returning now to FIG. 3, our goal is to find the optimum path through the 9 by 5 matrix (comprising C.sub.11 through C.sub.59) using the frequency information of FIG. 4 to thereby identify the five Chinese characters most likely corresponding to the five Chinese syllables "tai2 wan1 you3 tai2 feng1". More specifically, there are 21,870 possible paths through matrix (i.e., 9.times.5.times.6.times.9.times.9). One path would be that shown in the first row in FIG. 3--"aceag". Another path would be that shown in the second row--"bdfbh". Of course, by combining the first and second rows, other possible paths can be derived easily--e.g., "adfbh", "bdeag", etc. In fact, FIG. 5 shows the 32 combinations possible by combining just the symbols in the first and second rows in FIG. 3. The path "aceag" is listed as the first path in FIG. 5, "aceah" is listed as the second path, "acebg" the third . . . and "bdfbh". the last path. FIG. 5, also, shows to the right of each path the frequency calculation used to derive the optimum path. For example, the frequency calculation (or probability calculation) for the first path "aceag" is derived by looking up in FIG. 4 the individual ordered occurrence frequencies for the following pairs "ia", "ac", "ce", "ea", "ag", "gi"--namely, 322, 1513, 26, 25, 2, 41. Then these six numbers are multiplied together to derive the score for the use of the Chinese characters "aceag" resulting in the large number 25,967,013,800 shown in line 1 of FIG. 5. The only other probable path is shown on the third line--namely, "acebg"--and, in fact, represents the optimum path because the calculated probability number 30,121,736,008 is larger than any other calculated number. Thus, the Chinese characters "acebg" are selected as representing those characters defined by the Chinese symbols "tai2 wan1 you3 tai2 feng1" and, in fact, recite the sentence in Chinese: "Taiwan has typhoons." In more mathematical terms, I have discovered that the optimal path through the matrix represented by the Chinese characters can be derived by reference to the following formula where "c" refers to a Chinese character and p(c.sub.i .vertline.c.sub.i-1) refers to the probability of Chinese character i, given the previous adjacent occurrence of Chinese character i-1: ##EQU1## Conceptually one enumerates all of the possible paths though the lattice and then picks the best path according to the above formula. Of course, truly enumerating all possible paths is computationally expensive. However, as indicated in the above formula and previous discussion, the score for the entire path is computed by multiplying out the probabilities (frequencies) for adjacent pairs of characters, and one can therefore significantly reduce the paths that one has to consider by using dynamic programming (=Viterbi algorithm) techniques. In formal terms, note that for any syllable syl.sub.i, and for each c.sub.i,j (a Chinese character corresponding to syl.sub.i), one is considering the scores of all paths which end in c.sub.i,j. Note however, that we only need to keep the best path ending in c.sub.i,j. This is because when we move on to the next syllable syl.sub.i+l and consider all characters c.sub.i+l,k, we want to compute the scores for paths ending in the pair of characters c.sub.i,j c.sub.i+l,k (as well as the scores for paths ending in other characters corresponding to the pair syl.sub.i syl.sub. i+l); we do this computation for each path ending in c.sub.i,j, by multiplying the score for that path by the frequency of occurrence of the pair of characters c.sub.i,j c.sub.i+l,k, freq(c.sub.i,j c.sub.i+l,k). However, it is clear even before we perform these calculations that we really only need to consider the best scoring path ending in c.sub.i,j : since the same multiplier freq(c.sub.i,j c.sub.i+l,k) is used to compute the frequency of each path ending in the pair c.sub.i,j c.sub.i+l,k, the best path ending in the pair c.sub.i,j c.sub.i+l,k will simply be the best path ending in c.sub.i,j concatenated with the character c.sub.i+l,k. We therefore discard all but the best scoring path leading up to c.sub.i,j. Thus, rather than keeping around cc.sub.1 .multidot.cc.sub.2 . . . cc.sub.i-l paths (where cc.sub.m is the number of characters possibly corresponding to syl.sub.m) ending in c.sub.i,j, we only keep one. To illustrate, let us return to FIG. 5, which, as we have noted, represents a subset of the possible paths through the lattice given in FIG. 3. Suppose that we are considering syllable "wan1", in the third column of FIG. 5. Note that there are four distinct possible paths illustrated in FIG. 5 which end in possible transliterations of "wan1", namely: "iac", the initial subpath of the final paths numbered 1-8; "iad", the initial subpath of the final paths numbered 9-16; "ibc", the initial subpath of the final paths numbered 17-24; "ibd", the initial subpath of the final paths numbered 25-32. Following the above formal description, we can eliminate "ibc" because its score 0.times.0=0 obviously does not compete with the score of the other path ending in "c", namely "iac", whose score at this point is 322.times.1513=487,186. (Note that in practice the value 0 is not actually used, but rather an arbitrarily chosen very small valued constant; this is for technical reasons which do not affect the discussion here.) Naturally, there is no way that longer paths "ibce", "ibcf" can compete with "iace" or "iacf", respectively, either; this is because at the point we compute the scores for those paths, we are multiplying the scores for the pairs "ce" and "cf" with the scores for the paths "ibc" and "iac" and since we already know that "ibc" is an inferior candidate to "iac", it follows that "ibce" and "ibcf" are inferior candidates, respectively, to "iace" and "iacf". The subpath "ibc" can therefore be eliminated from further consideration. Both of the paths ending in "d", "iad" and "ibd" have 0 scores, and in principle we could eliminate both of these, but in practice one of them--this case "iad"--is kept around; it will be eliminated in later steps. At this point, then, we have eliminated two paths--"ibc" and "ibd"--and retained two--"iac" and "iad". Moving on to the syllable "wan1" in the fourth column of FIG. 5, we now have to consider the following possible paths: "iace", the initial subpath of the final paths numbered 1-4; "iacf", the initial subpath of the final paths numbered 5-8; "iade", the initial subpath of the final paths numbered 9-12; "iadf", the initial subpath of the final paths numbered 13-16. Of those paths ending in "e", namely "iace" and "iade", the former has a value of 322.times.1513.times.26=12,666,836 and the latter a value of 322.times.0.times.0=0; "iade" is therefore eliminated from further consideration. Of those paths ending in "f", namely "iacf" and "iadf", both have a value of 0 (322.times.1513.times.0 and 322.times.0.times.0 respectively); as above, one of these--in this case "iacf"--is kept around in practice. At this point we have kept only two paths "iace" and "iacf", and have eliminated six other possibilities; "iade" and "iadf" were eliminated on this step; and "ibce", "ibcf", "ibde", "ibdf" were eliminated on the previous step because of the elimination of "ibc" and "ibd" on that step. The frequency statistics of FIG. 4 used in the illustrative embodiment were derived from a corpus of 2.6 million characters of Chinese newspaper text. Estimates of the probabilities of the appearance of 2 adjacent Chinese characters can be derived by dividing the frequency of occurrence of a sequence by the size of the corpus. However, since all probability estimates thus derived represent the frequency divided by the corpus size, in practice, the corpus size is omitted from the calculation since it does not affect the maximization (thus the use of frequency rather than estimated probability in the above illustrative description of my method). To evaluate the effectiveness of my method, seven short samples of text of differing length were chosen, representative of various writing styles ranging from very classical to colloquial: i) Ad [classical] ii) Report [classical] iii) Newspaper social commentary taken from the training corpus [semi-classical] iv) Essay [more colloquial] v) Narrative [more colloquial] vi) Short story [colloquial] vii) Exposition [colloquial]. The performance of my method is given in the table in terms of percentage correct by character (hit rate) for each of the styles; the hit rate from my method is specified in the third column and should be compared with the hit rate achieved by merely picking the most common character given the pronounciation, which is given in the second column:
______________________________________
Style Lexical Probabilities Only
My Method
______________________________________
i [class.] 76% 93%
ii [class.] 73% 90%
iii [semi-class.]
76% 98%
iv [more coll.]
69% 73%
v {more coll.]
72% 86%
vi [coll.] 71% 89%
vii [coll.] 71% 92%
Total 73% 90%
______________________________________
A trend evident in these data is that there is some dependence upon style: my current training corpus is heavily classical in style since it is mostly derived from newspapers. As a consequence, texts (i-iii) which are classical in style are better rendered than the more colloquial texts, with the exception of (vii). I expect that this style dependence would become less marked if the training corpus were expanded to include other styles. Other modifications and applications of my invention will be apparent to those skilled in the art and are also within its spirit and scope.
|
Same subclass Same class Consider this |
||||||||||
