Compression method, method for compressing entry word index data for a dictionary, and machine translation system6502064Abstract A n-gram statistical analysis is employed to acquire frequently appearing character strings of n characters or more, and individual character strings having n characters or more are replaced by character translation codes of 1 byte each. The correlation between the original character strings having n characters and the character translation codes is registered in a character translation code table. Assume that a character string of three characters, i.e., a character string of three bytes, "sta," is registered as 1-byte code "e5" and that a character string of four characters, i.e., a character string of four bytes, "tion," is registered as 1-byte code "f1." Then, the word "station," which consists of a character string of seven characters, i.e., seven bytes, is represented by the 2-byte code "e5 f1," so that this contributes to a compression of five bytes. Claims What is claimed is: Description DETAILED DESCRIPTION OF THE INVENTION
TABLE 2
Entry Chara. Frequency Compress Entry Chara. Frequency Compress Entry
Chara. Frequency Compress
No. strg. count value No. strg. count value No.
strg. count value
#001 er 10945 5021 #071 mi 2382 1605 #141
cr 1315 1054
#002 in 9664 4960 #072 th 2350 1551 #142
gr 1306 1050
#003 ti 8081 4147 #073 ia 2346 1545 #143
ke 1302 1048
#004 on 7439 3987 #074 ur 2287 3072 #144
og 1300 1045
#005 at 6865 3772 #075 nc 2282 1524 #145
ry 1286 2088
#006 es 6742 3645 #076 as 2271 1523 #146
bo 1284 2062
#007 re 6717 3562 #077 sh 2250 1503 #147
vi 1272 1017
#008 en 6116 3363 #078 ent 2235 3002 #148
sp 1266 1012
#009 te 6537 3343 #079 ter 2223 1500 #149
tu 1254 2008
#010 an 6458 3326 #080 pr 2143 1498 #150
ag 1250 1003
#011 ne 6424 3323 #081 ec 2132 1490 #151
ph 1221 1002
#012 le 6379 3256 #082 ha 2127 1489 #152
ene 1184 1982
#013 al 6255 9315 #083 om 2088 1465 #153
ity 1181 988
#014 ly 5579 3059 #884 ho 2030 1463 #154
um 1179 976
#015 st 5536 6002 #085 ul 2016 1449 #155
ally 1178 971
#016 ss 5440 2934 #086 iv 1971 1435 #156
oc 1156 1932
#017 ra 5384 2797 #087 hi 1942 1427 #157
lcal 1156 956
#018 is 5123 2792 #088 ble 1928 4209 #158
do 1129 956
#019 ar 5021 2775 #089 ge 1898 1402 #159
ep 1127 952
#020 li 4969 2749 #090 no 1830 1402 #160
iz 1126 1898
#021 ic 4963 2703 #091 pa 1803 1393 #161
da 1124 948
#022 nt 4917 2684 #092 ns 1796 1388 #162
pi 1117 938
#023 or 4884 2538 #093 ty 1794 1378 #163
tt 1117 938
#024 rl 4807 2518 #094 po 1778 1374 #164
tor 1112 931
#025 ess 4321 2504 #095 abl 1172 1364 #165
gi 1095 2784
#026 it 4119 2502 #096 mo 1761 1354 #166
cu 1092 1846
#027 de 4046 2426 #097 all 1734 1353 #167
ant 1091 918
#028 io 4015 2382 #098 ad 1727 2700 #168
per 1090 918
#029 co 4006 2346 #099 ate 1714 1337 #169
ru 1080 918
#030 ed 3987 2345 #100 ct 1705 1335 #170
rd 1078 1792
#031 ng 3967 2287 #101 ica 1682 1334 #171
ver 1074 1790
#032 ro 3922 2274 #102 ous 1666 1330 #172
sm 1054 1788
#033 ca 3734 2271 #103 rt 1654 2644 #173
tiv 1052 1774
#034 il 3646 4446 #104 em 1653 1315 #174
wa 1050 883
#035 el 3562 2222 #105 ci 1648 1306 #175
ga 1045 1756
#036 la 3540 2191 #106 atio 1645 1302 #176
ali 1044 875
#037 ou 3442 2132 #107 ot 1622 1300 #177
lit 1026 874
#038 ve 3435 2127 #108 atior 1622 1286 #178
ex 1017 870
#039 ta 3343 2089 #109 ine 1617 1284 #179
rs 1112 867
#040 un 3326 2088 #110 am 1610 1272 #180
sti 1004 866
#041 nes 3270 2063 #111 ut 1551 1266 #181
lu 1003 1726
#042 se 3256 2044 #112 im 1845 1254 #182
ow 1002 861
#043 ness 3105 2030 #113 ist 1536 1250 #183
rat 991 1718
#044 ma 3087 2016 #114 os 1524 1233 #184
sl 988 853
#045 li 3072 1945 #115 ck 1523 1221 #185
fo 976 852
#046 us 3040 1942 #116 ig 1500 2432 #186
gl 971 1700
#047 ing 3001 1931 #117 ive 1494 2362 #187
tra 966 2538
#048 di 2958 1912 #118 id 1490 1179 #188
qu 956 1690
#049 ion 2957 1909 #119 bi 1489 3534 #189
men 952 1666
#050 bl 2943 1898 #120 ap 1465 1162 #190
res 949 827
#051 nd 2940 1879 #121 oo 1463 1156 #191
br 938 1622
#052 ea 2934 1830 #122 aa 1435 3468 #192
od 938 1616
#053 ab 2925 1803 #123 cal 1433 1153 #193
ra 931 1612
#054 to 2869 1796 #124 lly 1432 1129 #194
tive 928 1608
#055 me 2831 1778 #125 pi 1427 1127 #195
and 923 1604
#056 tr 2825 1776 #126 able 1403 1126 #196
fe 918 799
#057 na 2823 1774 #127 op 1402 1124 #197
ip 918 799
#058 si 2792 1761 #128 sc 1402 1123 #198
rr 918 799
#059 he 2782 1757 #129 su 1393 1119 #199
man 896 1592
#060 nl 2749 1727 #130 ee 1388 1117 #200
dis 895 1586
#060 atl 2719 3428 #131 ai 1378 1117
#062 io 2703 1677 #132 so 1364 2224 total
382234
#063 pe 2588 3332 #133 ba 1354 1095
#064 ch 2538 1654 #134 ir 1353 1092
#065 tio 2538 1653 #135 tlc 1350 2182
#066 et 2518 1648 #136 mp 1337 2180
#067 ce 2510 1622 #137 ie 1335 1080
#068 ol 2604 6488 #138 fi 1334 1078
#069 ac 2502 3234 #139 be 1330 2148
#070 tion 2398 1610 #140 con 1322 2148
Then, a compression contribution value when each character string having n characters, i.e., n bytes, in the statistics data is replaced with a one byte code is calculated (step S120). Sequentially, the entries in the statistic table are sorted in the descending order of their compression contribution values (step S122). As is described above, a compression contribution value is acquired by multiplying the frequency count and a difference of bytes (n-1). At step S124, overlaps in the statistics are removed. An overlap in the statistics is, for example, where for a long character string "ABCD," the frequency counts for shorter character strings "ABC," I.sup.? BCD," "AB," "IBC" and "CD," which are included in "ABCD," are obtained by overlapping the frequency count of the string "ABCD." Since the longer character string has a greater compression contribution value, the long character string should remain in the statistic table. Therefore, the frequency count for the character string "ABCD" must be subtracted from the frequency counts in the individual entries for the short character strings "ABC," "BCD," "AB," "BC" and "CD." For example, when the frequency counts for "ation" and "tion," in the statistic table generated immediately after the first IF loop is terminated, are 1622 and 2398, the frequency count 1622, which is a double count for "ation," is subtracted from the frequency count 2398 of "tion," and the value 776 (=2398-1622) is the true frequency count for the character string "tion." After the overlaps of the statistics are removed at step S124, the entries in the statistic table are sorted again in accordance with the descending order of the compression contribution values (step S126). Table 3 shows a statistic table obtained by sorting the entries in accordance with their compression contribution values. This is the result obtained by processing the entry word index data in a system dictionary included in the "King of Translation."
TABLE 3
Entry Chara. Frequency Compress Entry Chara. Frequency Compress Entry
Chara. Frequency Compress
No. strg. count value No. strg. count value No.
strg. Count value
#001 ness 3105 9315 #071 ul 2016 2016 #141
nc 1449 1449
#002 ation 1622 6488 #012 stl 1004 2008 #142
sa 1435 1435
#003 ing 3001 6002 #073 rat 991 1982 #143
pi 1427 1427
#004 ar 5021 5021 #074 is 1945 1945 #144
sta 706 1412
#005 re 4960 4960 #075 hi 1942 1942 #145
ect 705 1410
#006 ter 2223 4446 #076 tra 966 1932 #146
nal 704 1408
#007 able 1403 4209 #077 ic 1931 1931 #147
tri 703 1406
#008 ly 4147 4147 #078 it 1912 1912 #148
op 1402 1402
#009 ed 3987 3987 #079 re 1909 1909 #149
sc 1402 1402
#010 or 3772 3772 #080 ge 1898 1898 #150
su 1393 1393
#011 ie 3645 3645 #081 res 949 1898 #151
ari 695 1390
#012 el 3562 3562 #082 me 1879 1879 #152
ee 1388 1388
#013 ally 1178 3534 #083 and 923 1846 #153
ai 1378 1378
#014 ical 1156 3468 #084 no 1830 1830 #154
us 1374 1374
#015 ate 1714 3428 #085 pa 1803 1803 #155
com 685 1310
#016 er 3363 3363 #086 ns 1796 1796 #156
so 1364 1364
#017 ta 3343 3343 #087 man 896 1792 #157
ba 1354 1354
#018 ous 1666 3332 #088 dis 895 1790 #158
ism 677 1354
#019 un 3326 3326 #089 nde 894 1788 #159
ir 1353 1353
#020 ri 3323 3323 #090 po 1778 1778 #160
ize 676 1352
#021 se 3256 3256 #091 ou 1776 1776 #161
ato 674 1348
#022 ine 1617 3234 #092 in 1774 1774 #162
ina 672 1344
#023 ist 1536 3072 #093 lin 887 1774 #163
mp 1337 1337
#024 ro 3059 3059 #094 mo 1761 1761 #164
ie 1335 1335
#025 ent 1501 3002 #095 to 1757 1757 #165
fi 1334 1334
#026 ea 2934 2934 #096 der 878 1756 #166
be 1330 1330
#027 an 2797 2797 #097 ad 1727 1727 #167
cr 1315 1315
#026 si 2792 2792 #098 pro 863 1726 #168
gr 1306 1306
#029 tive 928 2784 #099 nte 859 1718 #169
ke 1302 1302
#030 ia 2775 2775 #100 ili 850 1700 #170
og 1300 1300
#031 nl 2749 2749 #101 tin 845 1690 #171
ry 1286 1286
#032 lo 2703 2703 #102 ce 1677 1677 #172
bo 1284 1284
#033 tic 1350 2700 #103 nce 833 1666 #173
vi 1272 1272
#034 co 2684 2684 #104 rt 1654 1654 #174
sp 1266 1266
#035 con 1322 2644 #105 em 1653 1653 #175
tu 1254 1254
#036 ch 2538 2538 #106 cl 1648 1648 #176
ag 1250 1250
#037 enes 846 2538 #107 ot 1622 1622 #177
he 1233 1233
#036 et 2618 2538 #108 str 811 1622 #178
ph 1221 1221
#039 ol 2504 2504 #109 pre 808 1616 #179
um 1179 1179
#040 ac 2502 2502 #110 les 806 1612 #180
li 1162 1162
#041 ess 1216 2432 #111 am 1610 1610 #181
oc 1156 1156
#042 on 2426 2426 #112 her 804 iWs #182
ab 1153 1153
#043 mi 2382 2382 #113 th 1605 1605 #183
do 1129 1129
#044 ity 1181 2362 #114 int 802 1604 #184
ep 1127 1127
#045 ia 2346 2346 #115 est 796 1592 #185
iz 1126 1126
#446 en 2345 2345 #116 ete 793 1586 #186
da 1124 1124
#047 tion 776 2328 #117 ut 1551 1551 #187
nd 1123 1123
#048 ur 2287 2287 #118 im 1545 1545 #188
ss 1119 1119
#049 de 2274 2274 #119 era 767 1534 #189
pl 1117 1117
#050 as 2271 2271 #120 ist 765 1530 #190
tt 1117 1117
#051 tor 1112 2224 #121 nti 765 1530 #191
gi 1095 1095
#052 il 2222 2222 #122 os 1524 1524 #192
cu 1092 1092
#053 ment 734 2202 #123 ck 1523 1523 #193
ru 1080 1080
#054 ma 2191 2191 #124 cti 753 1506 #194
rd 1078 1078
#055 ant 1091 2182 #125 sh 1503 1503 #195
sm 1054 1054
#056 per 1090 2180 #126 ran 751 1502 #196
wa 1050 1050
#057 ati 1074 2148 #127 ig 1500 1500 #197
tr 1048 1048
#058 ver 1074 2148 #128 pe 1498 1498 #198
ga 1045 1045
#059 lity 711 2133 #129 ish 747 1494 #199
ex 1017 1017
#060 ec 2132 2132 #130 eri 746 1492 #200
rs 1012 1012
#061 ha 2127 2127 #131 id 1490 1490
#062 call 708 2124 #132 the 745 1490
total 490898
#063 na 2089 2089 #133 bi 1489 1489
#064 om 2068 2088 #134 rin 738 1476
#065 ali 1044 2088 #135 ona 734 1468
#066 di 2063 2063 #136 ap 1465 1465
#067 lit 1026 2052 #137 oo 1463 1463
#068 al 2044 2044 #138 ted 730 1460
#069 ines 677 2031 #139 gra 727 1454
#070 ho 2030 2030 #140 min 727 1454
As is apparent from Table 3, the compression contribution value 9315 for character string "ness," which is the first entry, is the highest, and compression contribution value 6488 for character string "ation" is the second highest. FIG. 4 is a detailed flowchart showing the n-gram statistical analysis process routine at step S108. The individual steps will now be described. At step S200, the length of a character string for an entry word that is being processed (i.e., it is substituted into the variable REST) is substituted into variable LEN. At step S202 a check is performed to determine whether N is equal to or smaller than LEN. When N exceeds LEN, the n-gram statistical analysis process is not required (there are no n-gram statistics for character strings having (N-1) characters, for example). Program control exits at branch "No" at decision block S202 and the process routine is terminated. When N is equal to or smaller than LEN, program control branches to "Yes" and the following process is performed. At step S204 a value of 1 is substituted into variable J. The variable J is a variable for designating a character string segment that consists of the Jth and the following characters of the character string REST. In the IF loop constituted by a conditional sentence (S206), "Is J equal to or smaller than LEN-N+1, " the n-gram statistic analysis is conducted for character strings having N characters, which are included in the character string segment consisting of the Jth and following characters of the character string REST in the process. When J exceeds LEN-N+1, no character strings having N or more characters remain in the segment consisting of Jth and following characters of the character string REST. Program control exits at branch "No" at decision block S206 and the process routine is terminated. When J is equal to or smaller than LEN-N+1, the following step is performed. At step S208 a check is performed to determine whether a character string having N characters beginning with the Jth character of the character string REST already exists in the statistic table. If REST="ABCD," and J=2 and N=2, a check is performed to determine whether character string BC, which consists of two characters beginning with the second character of the character string ABCD, is present in the statistic table. If a corresponding entry exists in the statistic table, the frequency count of the entry is incremented by one (step S210). When no such entry is found in the statistic table, a new entry is added and its frequency count is set to 1 (step S212). The n-gram statistical analysis has been conducted for a character string having N characters beginning at the Jth character of the character string REST, and J is incremented by one (step S214). Program control then returns to step S206 to repeat the n-gram statistical analysis for a character string having N characters beginning at the (J+1)th character. Generation of character translation code table: After the statistic table has been prepared in which entries are arranged in accordance with the descending order of their compression contribution values in the process routine in FIGS. 3 and 4, a character code translation table for replacing a character string with code is generated. To embody the present invention, a new table for translating characters into code may be designed. In this embodiment, an ASCII (American Standard Code for Information Interchange) code table is employed that is well known and widely used as a table for assigning alphanumerical characters to code, and unused columns in this code table are newly assigned for character strings having high compression contribution values. The advantage of the employment of the ASCII code table is that conventional code can be used unchanged for regular alphanumerical characters, such as a, b, c, . . . and 0, 1, 2, . . . . The ASCII code table conforms to the specifications established by ANSI (American National Standards Institute). FIG. 5 is a flowchart showing a process routine for generating a new character translation code table in accordance with compression contribution values obtained in the n-gram statistical analysis process. The individual steps will now be explained. First, character strings in a count equivalent to the number of unused areas in the character translation code table are extracted from the high ranks of the statistic table (step S300). When the ASCII code table is used as the character translation code table, there are 185 unused columns (a case where English capital letters are not used), and only 185 highly ranked entries in the statistic table need be acquired. Then, the obtained character strings are sorted in alphabetical order (step S302). Each of the sorted character strings is assigned a position, beginning with the first unused area of the character translation code table (step S304). The character strings are assigned positions in alphabetical order because this facilitates the performance of a following dictionary search process, which will be described later. Table 4 shows a character translation code table prepared during the process routine in FIG. 5. The table is based on the ASCII code table, and conventional codes are assigned unchanged for regular alphanumerical characters, such as a, b, c . . . and 0, 1, 2, . . . (in Table 4, the conventional column assignments in the ASCII code table displayed are enclosed in frames). A character string "ab" having a high compression contribution value is assigned for unused column 0x01 in the ASCII code table, and character string "ot" is assigned for unused column 0xc9 in the table.
TABLE 4
00 01 02 03 04 05 06 07 08 09 0a
0b 0c 0d 0e 0f
0 .times. 00 (null) ab able ac ad ag ai al ali ally
am an and ant ap ar
0 .times. 10 ari as ate atiation ato ba be bi bo
call ce ch ci ck co
0 .times. 20 (space) ! " # $ % & ' ( )
* + , - . /
0 .times. 30 0 1 2 3 4 5 6 7 8 9
: ; < = > ?
0 .times. 40 @ com con cr cti de der di dis do
ea ec ect ed ee el
0 .times. 50 em en enes ent ep er era eri ess est
et .about. .Yen. ]
0 .times. 60 ' a b c d e f g h i
j k l m n o
0 .times. 70 p q r s t u v w x y
z [ .parallel. ] .
0 .times. 80 fi ge gr gra ha he her hi ho ia
ic ical id ie ig il
0 .times. 90 ili im in ina ine ines ing int ir is
ish ism ist it ity iz
0 .times. a0 ize ke la lat le les li lin lit lity
lo ly ma man me ment
0 .times. b0 mi min mo mp na nal nc nce nde ness
ni no ns nte nti oc
0 .times. c0 og ol om on ona oo op or os ot
ou ous pa pe per ph
0 .times. d0 pi po pre pro ra ran rat re res ri
rin ro rt ry sa sc
0 .times. e0 se sh si so sp sta ste sti str su
ta ted ter th the tic
0 .times. f0 tin tion tive to tor tra tri tu ul um
un ur us ut ver vi
It should be noted that Table 4 shows the results obtained by processing the previously described entry word index data in a system dictionary included in the "King of Translation." Generation of dictionary entry word index: When a new character translation code table is prepared, this is employed to generate a new dictionary entry word index. In Table 4 representing character translation code, a character string having n characters, i.e., n bytes (n is an integer greater than 1) is replaced with a one-byte code (previously described). Among the entry words, since a character string of n bytes that has a high compression contribution value is replaced by one byte code in accordance with the character translation code table, a compression effect of (n-1) bytes can be provided by preparing a new entry word index. FIG. 6 is a flowchart showing the process routine for generating (compressing) a dictionary entry word index. The individual steps will now be explained. The first IF loop, constituted by the conditional sentence (step S400), "Is there an unprocessed entry word?," initiates the compression process for the entire entry word index. In the first IF loop, the first remaining entry word is extracted from the original entry word index, and is substituted into a variable STR (step S402). An initial value of 1 is substituted into variables I and J, and the length of the character string STR is substituted into variable LEN (step S404). The variable I is used to designate the Ith character of the original character string STR, and the variable J is used to designate the Jth character of a new character string NEW. In the second IF loop, constituted by the conditional sentence (step S406), "Is I equal to or smaller than LEN," the compression process for the character string STR is performed. In the compression process, the individual character string segments of the character string STR are replaced by codes from the character translation code table. First, a character string segment that consists of the Ith and the following characters of the character string STR is compared with each character string in the character translation code table shown as Table 4 (step S408). This comparison is performed in the reverse direction, starting at the last entry in the character translation code table. Since character strings are assigned in the character translation code table in alphabetic order (see Table 4), the table is searched in the reverse direction so that character string having more characters can be examined first in the comparison process. When, for example, a character string segment "lity" exists at the Ith and the following characters of the character string STR, "lit" and "lity" are selected as candidate matching character strings in the Table 4, and the character string segment is first compared with "lity," which appears later in the alphabet order (i.e., has more characters). If a character string that matches the character string segment that consists of the Ith and the following characters of the character string STR is found in the character translation code table, the matching code is substituted into the Jth character of a new character string NEW (step S410), and the variable I is incremented by a number equivalent to the number of characters of this matching character string (step S412). For example, when the segment that consists of the Ith and the following characters of the string STR includes a 4-byte character string "ness," the character string segment is replaced by a one-byte character "b9," in accordance with the character translation code table. At this time, the variable I is incremented by four. If in the code table there is no character string that matches the character string segment that consists of the Ith and the following characters of the character string STR, the Ith character of the original character string STR is substituted into the Jth character string of the new character string NEW (step S414), and the variable I is incremented by one (step S416). After a matching code, or one character of the original string, is substituted into the Jth character of the new character string NEW, and the variable J is incremented by one (step S418), program control returns to step S406 to repeat the above described IF loop processing. When the variable I exceeds the character string length LEN, it means that the process for translating the original character string STR into the new character string NEW has been terminated. Program control exits the second IF loop at branch "No" at decision block S406. At step S420, the original entry STR in the entry word index is replaced by the translated code NEW, and program control thereafter returns to step S400. At step S400, a check is performed to determine whether unprocessed entry words remain in the entry word index. If so, the above described process is repeated for the remaining entry words. If there is no unprocessed entry word, it is assumed that the entire entry word index has been processed. Program control thereafter exits the routine at branch "No" at decision block S400, and the processing routine is terminated. Table 5 shows one part of the new entry word index that is generated while being compared with the original entry words. In the new entry word index, one byte numbers are listed using the hexadecimal numbering system.
TABLE 5
New entry Original entry New entry Original entry
word index word index word index word index
61 2d 19 6d 62 a - bo m b 01 47 63 15 72 ab di c ato r
61 2d 63 0e cd 6c 82 a - c ap pe l la 01 49 ae 6e ab do me n
61 2d 45 75 78 a - de u x 01 49 b1 07 ab do min al
61 2d 66 c3 64 a - f on d 01 49 b1 09 ab do min ally
61 2d 66 c7 74 69 c7 69 a - f or t i or i 01 64 75 63 74 ab d u c t
61 2d a2 2d 63 0f 74 65 a - la - c ar t e 01 64 15 44 c3 ab d u cti on
61 2d a2 2d 6b 96 a - la - k ing 01 64 75 63 f4 ab d u c tor
61 2d a2 2d b2 45 a - a - mo de 01 4a 6d ab ea m
61 2d a4 76 4f a - le v el 01 4b 4d 10 0b ab ec ed ari an
61 2d 6e f9 17 72 a - n um be r 01 Ad ab ed
61 2d d1 e6 d9 c7 69 a - po ste ri or i 01 4f 65 ab el e
61 2d 70 d9 c7 69 a - p ri or i 01 55 d5 lb ab er ran ce
61 2d 74 50 d1 a - t em po 01 55 d5 63 79 ab er ran c y
61 2e 63 2e a . c . 01 55 d5 74 ab er ran t
61 2e 6d 2e a . m . 01 55 d5 74 ab ab er ran t ly
61 2e 77 2e 6f 2e 6c 2e a . w . o . l . 01 55 d5 69 c3 ab er rat i on
61 2f 63 a / c 01 55 69 c4 6c ab er rat i ona l
61 0f 64 76 0f 6b a ar d v ar k 01 5a ab et
01 2d 92 9d 69 6f ab - in it i o 01 5a af ab et ment
01 2e ab . 01 5a ec ab et ter
01 03 69 ab ac i 01 5a f4 ab et tor
01 03 6b ab ac k 01 65 79 0b 1b ab e y an ce
01 03 fc ab ac us 01 65 79 0d ab e y ant
01 61 66 74 ab a f t 01 88 72 ab ho r
01 07 c3 65 ab al on e 01 88 72 d7 b7 ab ho r re nce
01 0c c3 ab and on 01 88 72 d7 6e 74 ab ho r re n t
01 0c c3 4d ab and on ed 01 88 72 d7 6e 74 ab ab ho r re n t
ly
01 0c c3 55 ab and on er 01 88 72 d7 72 ab ho r re r
01 0c c3 af ab and on ment 01 8c 0b 1b ab id an ce
01 11 65 ab as e 01 8c 65 ab id e
01 11 50 53 ab as em ent 01 8c 55 ab id er
01 11 68 ab as h 01 6c 96 ab id ing
01 11 85 64 ab as he d 01 90 74 8d 73 ab ili t ie s
01 11 68 af ab as h ment 01 90 74 79 ab ili t y
01 61 ea 62 a4 ab a ta b le 01 6a 4c ab j ect
01 12 ab ate 01 6a 69 c3 ab j ect i on
01 12 af ab ate ment 01 6a 4c ab ab j ect ly
01 13 73 ab ati s 01 6a 4c b9 ab j ect ness
01 61 74 74 99 ab a t t is 01 6a fb 14 ab j ur ation
01 61 14 f3 18 ab a t to ir 01 6a fb 65 ab j ur e
01 16 63 79 ab ba c y 01 6a fb 55 ab j ur er
01 16 74 89 6c ab ba t ia l 01 a3 65 ab lat e
01 17 73 73 ab be s s 01 a3 69 c3 ab lat i on
01 17 79 ab be y 01 a3 69 75 65 ab lat i v e
01 19 74 ab bo t 01 a2 fd ab la ut
01 62 72 2e ab b r 01 a2 7a 65 ab la z e
01 62 d7 76 2e ab b re v. 02 able
01 62 d7 ff 12 ab b re vi ate
01 62 d7 ff 14 ab b re vi ation
01 62 d7 ff 15 72 ab b re vi ato r
01 47 63 02 ab di c able
01 47 63 12 ab di c ate
01 47 63 14 ab di c ation
As is apparent from Table 5, entry word "a-bomb" is compressed into five bytes of code, "61 2d 19 6d 62," whereas entry word "abandon," which has seven characters, i.e., seven bytes, is replaced by the three-byte code "01 0c c3," so that a compression effect of four bytes is obtained. Entry word "able," which has four characters, i.e., four bytes, is replaced by the one-byte code "02," so that a compression effect of three bytes is obtained. As an experimental result, when the compression method in this embodiment was applied for the entry word index of the system dictionary for the "King of Translation," the original entry word index of 625 Kbytes was compressed to a length of 388 Kbytes. When the amount of entry word index data is small, they can be made resident in the main memory 14 of the computer system 100, without being exchanged (swapped out). Since the access speed for memory-resident data is high, an increase in the dictionary search speed is obtained. Especially for a machine translation system that prepares some dictionaries, the compression of data to reduce its Isize is very effective when it is desired to make the entry word index data memory resident. C-2. Second embodiment A second embodiment for compressing entry word index data will now be described while referring to FIGS. 7 to 11. The second embodiment differs from the first embodiment in that a difference between entry word character strings that are adjacent to each other is acquired prior to the performance of a compression process based on the n-gram statistical analysis. Differential process for entry word index data: FIG. 7 is a flowchart showing the process routine for calculating a difference between adjacent entry word character strings. The individual steps will now be explained. An empty character is entered as an immediately preceding character string PREV (step S500). An IF loop, constituted by the conditional sentence (step S502), "Is there an unprocessed entry word?," initiates the differential process for all the entry words. In the IF loop, initially, the first remaining entry word character string is extracted from the original entry word index, and is substituted into the current character string CURR (step S504). Then, a check is performed to determine how many characters starting at the beginning of the preceding character string PREV match those in the current character string CURR (step S506). The count of the matching characters and the difference in the character strings PREV and CURR are output (step S508). The current character string CURR is substituted into the immediately preceding character string PREV (step S510), and program control returns to step S502. At step S502, the acquisition of an entry word remaining in the original entry word index is attempted. If there is an unprocessed entry word, program control branches to "Yes" at the decision block S502, and the above described differential process is repeated. If the differential process has been completed for all the entry words, program control exits the routine at branch "No" at decision block S502, and the process routine is terminated. Table 6 shows one part of the entry word index for which the differential process is performed while being compared with the original entry word index. The character string "a-bomb" is defined as the head of the entry word index. The original entry word index is that of a system dictionary in the "King of Translation."
TABLE 6
Matching Matching
chara. Differential Original chara. Differential Original
count character string character string count character string
character string
00 a-bomb a-bomb 06 te abdicate
02 cappella a-cappella 07 ion
abdication
02 deux a-deux 07 or abdicator
02 fond a-fond 03 omen abdomen
04 rtiori a-fortiori 05 inal abdominal
02 la-carte a-la-carte 09 ly
abdominally
05 king a-la-king 03 uct abduct
05 mode a-la-mode 06 ion abduction
03 evel a-level 06 or abductor
02 number a-number 00 abeam abeam
02 posteriori a-posteriori 03 cedarian
abecedarian
03 riori a-priori 03 d abed
02 tempo a-tempo 03 le abele
00 a.c. a.c. 03 rrance aberrance
02 m. a.m. 08 y aberrancy
02 w.o. l. a.w.o.l. 07 t aberrant
01 /c a/c 08 ly
aberrantly
01 ardvark aardvark 06 tion
aberration
01 b-initio ab-initio 0a al
aberrational
02 ab. 03 t abet
00 abaci abaci 04 ment abetment
04 k aback 04 ter abetter
04 us abacus 05 or abettor
03 ft abaft 03 yance abeyance
03 lone abalone 06 t abeyant
03 ndon abandon 00 abhor abhor
07 ed abandoned 05 rence
abhorrence
08 r abandoner 08 t abhorrent
07 ment abandonment 09 ly
abhorrently
03 se abase 07 r abhorrer
05 ment abasement 00 abidance abidance
04 h abash 04 e abide
05 ed abashed 05 r abider
05 ment abasement 04 ing abiding
00 abatable abatable 03 lities abilities
04 e abate 06 y ability
05 ment abatement 02 ject abject
04 is abatis 06 ion abjection
04 tis abattis 06 ly abjectly
05 oir abattoir 06 ness
abjectness
00 abbacy abbacy 03 uration
abjuration
04 tial abbatial 05 e abjure
03 ess abbess 06 r abjurer
04 y abbey 00 ablate ablate
03 ot abbot 05 ion ablation
03 r. abbr. 06 ve ablative
04 ev. abbrev. 04 ut ablaut
06 iate abbreviate 04 ze ablaze
09 ion abbreviation 03 e able
09 or abbreviator
00 abdicable abdicable
As is shown in Table 6, since the first to the sixth characters of entry word "abjection," which is just below "abject," match the character string "abject," the matching character count 06 and a differential character string "ion" form a new entry word. Further, since the first to the sixth characters of the next entry word "abjectly," which is in the two line below from "abject," match the immediately preceding entry word "abjection," the matching character count 06 and a differential character string "ly" forms a new entry word. An entry word index for which the differential process is performed is hereinafter called a "tentative entry word index." n-gram statistical analysis process: FIG. 8 is a flowchart showing the processing for calculating a compression contribution value for each entry word. The processing is positioned as a pre-process for the tentative entry word index data compression processing. The compression contribution value represents the compression effect imposed on the tentative entry word index data when a character string of n characters (i.e., n bytes) or more is replaced by a character string of less than n characters (one byte in this case). It would be easily understood that the compression contribution value is large when a character string that frequently appears in a tentative entry word index, or a character string that consists of many characters (many bytes), is replaced by a single character (i.e., a one-byte code). The frequencies at which character strings having n characters (n=2, 3, . . . ) appear in the tentative entry word index are calculated using the so-called n-gram statistical analysis. The compression contribution value when a character string having n bytes is substituted with one byte code is acquired by multiplying the count at which the character string appears in the entry word index, and a byte difference (n-1). The individual steps in the flowchart in FIG. 8 will now be described in detail. A first IF loop constituted by a conditional sentence (step S600), "Is there an unprocessed entry word?" is initiated to examine the n-gram statistics for the entire tentative entry word index. In the first IF loop, initially, a differential character string of the first remaining entry word is read from the tentative entry word index, and is substituted into variable REST (step S602). Then, value 2 is substituted into N (step S604), and the processing is initiated for the 2-gram statistical analysis. In the second IF loop constituted by a conditional sentence (step S606), "Is N equal to or smaller than the length of the REST character string?," the n-gram statistical analysis process is performed for the character string REST (step S608, described in detail later). When the n-gram statistical analysis process for N=2, i.e., the 2-gram statistical analysis process has been completed, N is incremented by one (step S610), and the same IF loop processing (i.e., the (N+1)-gram statistical analysis process) is repeated. When N exceeds the length of the character string REST, it is assumed that the n-gram statistical analysis process for the character string REST has been terminated, and at branch "No" of the decision block S606, program control exits the second IF loop and returns to step S600. At step S600, the acquisition of a differential character string for the next entry word in the tentative entry word index is attempted. If the n-gram statistical analysis process is terminated for all the entry words in the tentative entry word index, program control exists the first IF loop at branch "No" of decision block S600. The termination of the first IF loop represents the completion of the collection of the n-gram statistic data. At this time, a tentative n-gram statistic table is generated. Then, a compression contribution value when each character string having n bytes in the statistics data is replaced with a one byte code is calculated (step S620). Sequentially, the entries in the statistic table are sorted in the descending order of their compression contribution values (step S622). As is described above, a compression contribution value is acquired by multiplying the frequency count and a difference of bytes (n-1). At step S624, overlaps in the statistics are removed. An overlap in the statistics is, for example, where for a long character string "ABCD," the frequency counts for shorter character strings "ABC," "BCD," "AB," "BC" and "CD," which are included in "ABCD," are obtained by overlapping the frequency count of the string "ABCD." Since the longer character string has a greater compression contribution value, the long character string should remain in the statistic table. Therefore, the frequency count for the character string "ABCD" must be subtracted from the frequency counts in the individual entries for the short character strings "ABC," "BCD," "BC," "BC" and "CD." After the overlaps of the statistics are removed at step S624, the entries in the statistic table are sorted again in accordance with the descending order of the compression contribution values (step S626). Table 7 shows a statistic table obtained by sorting the entries in accordance with their compression contribution values. This is the result obtained by processing the entry word index data in a system dictionary included in the "King of Translation."
TABLE 7
Entry Chara. Frequency Compress Entry Chara. Frequency Compress Entry
Chara. Frequency Compress
No. strg. count value No. strg. count value No.
strg. Count value
#001 ness 3054 9162 #071 one 189 378 #141
ble 137 274
#002 ly 3745 3745 #072 ot 372 372 #142
out 137 274
#003 ing 1625 3250 #073 sa 372 372 #143
tive 91 273
#004 tion 1032 3096 #074 st 372 372 #144
re 271 271
#005 able 762 2286 #075 ver 183 366 #145
om 266 266
#006 ion 942 1884 #076 ical 121 363 #146
ber 133 266
#007 atio 617 1851 #077 ingly 90 360 #147
ir 264 264
#008 le 1702 1702 #078 izat 119 357 #148
bo 263 263
#009 ability 265 1590 #079 per 178 356 #149
is 263 263
#010 ed 1559 1559 #080 di 352 352 #150
ci 262 262
#011 ment 518 1554 #081 ta 352 352 #151
ded 131 262
#012 lity 505 1515 #082 her 176 352 #152
gra 131 262
#013 er 1489 1489 #083 oo 350 350 #153
han 131 262
#014 ilit 496 1488 #084 ite 174 348 #154
tr 261 261
#015 bill 491 1473 #085 ure 174 348 #155
land 87 261
#016 or 1268 1268 #086 am 346 346 #156
do 260 260
#017 abil 365 1095 #087 ia 346 346 #157
table 65 260
#018 al 1005 1005 #088 ibil 115 345 #158
ger 129 258
#019 loss 332 996 #089 ster 115 345 #159
ow 255 255
#020 ally 316 948 #090 ard 170 340 #160
ai 254 254
#021 ity 458 916 #091 da 339 339 #161
iti 126 252
#022 ate 457 914 #092 co 338 338 #162
ral 126 252
#023 ism 456 912 #093 sion 112 336 #163
ke 250 250
#024 man 448 896 #094 ha 335 335 #164
wa 249 249
#025 en 838 838 #095 ina 165 330 #165
head 83 249
#026 ter 418 836 #096 pe 329 329 #166
sc 248 248
#027 zati 278 834 #097 os 326 326 #167
ker 124 248
#028 zation 159 795 #098 nce 163 326 #168
ran 123 246
#029 ous 377 754 #099 hi 325 325 #169
va 244 244
#030 ily 367 734 #100 nt 325 325 #170
nal 122 244
#031 ve 711 711 #101 late 108 324 #171
meter 61 244
#032 ines 237 711 #102 um 322 322 #172
so 242 242
#033 tic 345 690 #103 ide 161 322 #173
che 121 242
#034 ist 341 682 #104 olog 107 321 #174
ress 80 240
#035 ence 221 663 #105 rate 107 321 #175
ain 119 238
#036 iness 160 840 #106 ze 318 318 #176
ary 119 238
#037 ie 639 639 #107 rt 317 317 #177
ere 119 238
#038 ance 206 618 #108 eil 158 316 #178
ock 119 238
#039 ro 611 611 #109 as 314 314 #179
po 236 236
#040 se 602 602 #110 line 104 312 #180
eri 117 234
#041 ian 293 586 #111 pa 310 310 #181
tan 117 234
#042 ted 276 552 #112 the 155 310 #182
ouse 78 234
#043 ibility 90 540 #113 catl 103 309 #183
rian 78 234
#044 ch 522 522 #114 red 154 308 #184
ut 233 233
#045 ial 259 518 #115 ld 306 306 #185
cti 116 232
#046 age 258 516 #116 logi 102 306 #186
tal 116 232
#047 ish 258 516 #117 la 302 302 #187
ification 29 232
#048 th 456 456 #118 ace 150 300 #188
kin 114 228
#049 ization 76 456 #119 ari 150 300 #189
card 76 228
#050 el 455 455 #120 tte 150 300 #190
house 57 228
#051 ol 443 443 #121 ric 148 296 #191
ast 113 226
#052 ge 430 430 #122 ect 146 292 #192
est 113 226
#053 ic 430 430 #123 era 146 292 #193
nic 113 226
#054 ful 215 430 #124 icat 97 291 #194
rac 113 226
#055 ee 415 415 #125 ba 289 289 #195
wor 113 226
#056 ur 415 415 #126 to 287 287 #196
eter 75 225
#057 ent 207 414 #127 ear 143 286 #197
no 223 223
#058 ship 138 414 #128 nde 143 286 #198
ff 222 222
#059 et 408 408 #129 na 284 284 #199
mo 222 222
#060 and 202 404 #130 ton 141 282 #200
up 221 221
#061 ho 401 401 #131 ngly 94 282
#062 ry 400 400 #132 tabl 94 282 total
108221
#063 der 200 400 #133 ill 140 280
#064 ious 132 396 #134 ome 140 280
#065 ight 131 393 #135 und 140 280
#066 ma 392 392 #136 ga 279 279
#067 sh 384 384 #137 op 278 278
#068 ni 383 383 #138 min 139 278
#069 si 380 380 #139 ingl 92 276
#070 ant 189 378 #140 ade 137 274
As is apparent from Table 7, the compression contribution value 9162 for character string "ness," which is the first entry, is the highest, and compression contribution value 3745 for character string "ly" is the second highest. FIG. 9 is a detailed flowchart showing the n-gram statistical analysis process routine at step S608. The individual steps will now be described. At step S700, the length of a differential character string for an entry word being processed (i.e., it is substituted into the variable REST) is substituted into variable LEN. At step S702 a check is performed to determine whether N is equal to or smaller than LEN. When N exceeds LEN, the n-gram statistical analysis process is not required (there are no n-gram statistics for character strings having (N-1) characters, for example). Program control exits at branch "No" at decision block S702 and the process routine is terminated. When N is equal to or smaller than LEN, program control branches to "Yes" and the following process is performed. At step S704 a value of 1 is substituted into variable J. The variable J is a variable for designating a character string segment that consists of the Jth and the following characters of the character string REST. In the IF loop constituted by a conditional sentence (S706), "Is J equal to or smaller than LEN-N+1,1" the n-gram statistic analysis is conducted for character strings having N characters, which are included in the character string segment consisting of the Jth and following characters of the character string REST in the process. When J exceeds LEN-N+1, no character strings having N or more characters remain in the segment consisting of Jth and following characters of the character string REST. Program control exits at branch "No" at decision block S706 and the process routine is terminated. When J is equal to or smaller than LEN-N+1, the following step is performed. At step S708 a check is performed to determine whether a character string having N characters beginning with the Jth character of the character string REST already exists in the statistic table. If REST="ABCD," and J=2 and N=2, a check is performed to determine whether character string BC, which consists of two characters beginning with the second character of the character string ABCD, is present in the statistic table. If a corresponding entry exists in the statistic table, the frequency count of the entry is incremented by one (step S710). When no such entry is found in the statistic table, a new entry is added and its frequency count is set to 1 (step S712). The n-gram statistical analysis has been conducted for a character string having N characters beginning at the Jth character of the character string REST, and J is incremented by one (step S714). Program control then returns to step S706 to repeat the n-gram statistical analysis for a character string having N bytes beginning at the (J+1)th character. Generation of character translation code table: After the statistic table has been prepared in which entries are arranged in accordance with the descending order of their compression contribution values in the process routine in FIGS. 8 and 9, a character code translation table for replacing a character string with code is generated. To embody the present invention, a new table for translating characters into code may be designed. In this embodiment, an ASCII (American Standard Code for Information Interchange) code table is employed that is well known and widely used as a table for assigning alphanumerical characters to code, and unused columns in this code table are newly assigned for character strings having high compression contribution values. The advantage of the employment of the ASCII code table is that conventional code can be used unchanged for regular alphanumerical characters, such as a, b, c, . . . and 0, 1, 2, . . . . The ASCII code table conforms to the specifications established by ANSI (American National Standards Institute). FIG. 10 is a flowchart showing a process routine for generating a new character translation code table in accordance with compression contribution values obtained in the n-gram statistical analysis process. The individual steps will now be explained. | ||||||
