Weighting method for use in information extraction and abstracting, based on the frequency of occurrence of keywords and similarity calculations6240378Abstract An information abstracting method and apparatus for extracting and displaying keywords as an information abstract. Given a large number of character string data sets divided into prescribed units, the extracted keywords are significant and effective in describing a topic common to the plurality of units. The information abstracting apparatus comprises an input section for accepting an input of character string data divided into prescribed units, with each individual character represented by a character code, and an output section for displaying the result of information abstracting. Keywords contained in each of the prescribed units are extracted by a keyword extracting section from the character string input data from the input section. A score is calculated for each keyword by a score calculating section, so that a higher score is given to a keyword extracted from a larger number of units. On the basis of the calculated scores, keywords are selected by an abstracting section and are outputted as an information abstract by the output section. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
Frequency of
Keyword occurrence
{character pullout} 2
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 2
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
The keyword count results thus obtained are stored (pushed) onto a FILO (first-in-last-out) stack. In step a5, one keyword/frequency pair is retrieved from the stack stored in step a4. The retrieved data is deleted from the stack. Therefore, by repeatedly executing step a5, there will be left no data that can be retrieved from the stack. When all data have been removed, the process proceeds to step a10. In step a6, the value in the "keyword" item of the pair retrieved in step a5 is retrieved. Steps a7 to a9 are performed in the score calculating section 4. In step a7, using the keyword retrieved in step a6 it is judged whether any keyword identical to it exists in a keyword frequency list. The keyword frequency list is constructed as shown in Table 2 below and, initially, the values are set empty. Therefore, when this step is called for the first time, the process automatically proceeds to step a9. Table 2 looks as if it consists only of the keywords extracted from unit 1 in the data example shown in FIG. 5, but actually, the keyword frequency table eventually contains the keywords extracted from all the units. In the description of this embodiment, only a part of the keyword frequency table is shown as an example, since the whole table is too large to be shown here.
TABLE 2
Intra-unit Inter-unit
Keyword frequency share count
{character pullout} 2 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 2 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
The judging operation in step a7 is performed by sequentially comparing the keyword retrieved in step a6 against the keywords in the "keyword" item in the keyword frequency table. However, meaningless keywords may have been extracted as keywords in the keyword extracting operation, as described in connection with step a3; therefore, if one of the keywords compared one against each other is a subset of the other and the character count of the subset portion is longer than half the keyword length, such as {character pullout} versus {character pullout}, then these keywords are regarded as a match, and the keyword is treated as consisting of the common portion {character pullout}. Step a8 is performed when the keyword retrieved in step a6 matches a keyword already entered in the keyword frequency list. This means that identical keywords exist in more than one unit. In this case, 1 is added to the value of "Inter-unit share count" in the keyword table format shown in Table 2. As for the value of "Intra-unit frequency", the value in the "frequency of occurrence" item of the pair retrieved in step a5 is entered in the item of "Intra-unit frequency". For example, in the keyword frequency table such as Table 2, consider the case where the pair retrieved in step a5 is {character pullout}{character pullout}, 3). In this case, by performing step a8 the keyword frequency table as shown below (Table 3) is obtained.
TABLE 3
Intra-unit Inter-unit
Keyword frequency share count
{character pullout} 2 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 2 0
{character pullout} 1 0
{character pullout} 4 1
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
As shown in Table 3, by performing step a8 changes are made to the keyword, intra-unit frequency, and inter-unit share count in the row of the keyword {character pullout}{character pullout}. In Table 2, the keyword was {character pullout}{character pullout}, but during the comparison performed step a7, since this keyword contains {character pullout}{character pullout} within it and the common portion is longer than half the keyword length, these keywords are treated as the same keyword and combined as one common keyword {character pullout}{character pullout}. Step a9 is performed when the keyword retrieved in step a6 does not match any keywords stored in the keyword frequency table. In this case, the keyword retrieved in step a6 is added to the keyword frequency table, and the intra-unit frequency is set to the same value as the frequency of occurrence of the pair retrieved in step a5. The inter-unit share count is set to 0. In step a10, the abstracting section 5 retrieves keywords from the keyword frequency list in decreasing order of inter-unit share count and outputs as many keywords as can be displayed to the output section 2 for display. For example, when a maximum of three keywords can be displayed, the keywords, {character pullout}{character pullout}, {character pullout}, and {character pullout}, are selected in sequence from the keyword frequency table such as shown in Table 4.
TABLE 4
Intra-unit Inter-unit
Keyword frequency share count
{character pullout} 3 1
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 2 1
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 1 0
{character pullout} 2 1
{character pullout} 1 0
{character pullout} 2 0
{character pullout} 1 0
{character pullout} 3 3
{character pullout} 3 2
{character pullout} 3 2
{character pullout} 1 0
{character pullout} 2 1
{character pullout} 1 0
Between keywords having the same inter-unit share count, the keyword having the higher intra-unit frequency is selected first. Another method of keyword selection is to calculate priority on the basis of the weighted sum of the inter-unit share count and the intra-unit frequency and display the keywords in order of priority. In this method, constants, S and T, denoting weights, are determined first, and then, (intra-unit frequency.times.S+inter-unit share count.times.T) is calculated for each keyword, and keywords are selected in decreasing order of this value. Still another method is possible which considers the keyword length in making selection. In this method, priority is determined based on a weighted sum calculated by taking keyword character count into account. For example, in addition to S and T, a constant U is determined first, and then, (intra-unit frequency.times.S+inter-unit share count.times.T+keyword character count.times.U) is calculated, and keywords are selected in decreasing order of this value. The selected keywords are displayed as an information abstract on the output section 2. For example, when the keywords {character pullout}{character pullout}, {character pullout}, and {character pullout} have been selected, an output result such as the one shown in FIG. 6 is obtained. Table 4 does not consist only of the keywords extracted from unit 1, but consists of the keywords extracted from all the units. In an alternative method of extracting keywords as an information abstract, the inter-unit share count is compared only among the keywords selected from one unit, and keywords are extracted in decreasing order of inter-unit share count. In this manner, keywords contained in one unit, but also used in other units, can be extracted. Next, the operation will be described when the embodiment of the first invention is applied to another example. In the example hereinafter described, data input through the input section 1 is described in English. For example, consider the data example shown in FIG. 37. The units in this dada example consist of data representing abstracts of articles carried in Los Angeles Times, Chicago Tribune, USA TODAY, Atlanta Constitution, etc. In the following description, the differences of processing of English text data from that of Japanese text data will be explained in detail. The flowchart of FIG. 3, the same one used for explaining the processing of Japanese text data, will be used. Steps a1 and a2 are the same as described in the processing of Japanese text data. In step a3, in the case of Japanese data a keyword was extracted based on the difference in character type, but in the case of the English language, since words are already separated by a space, a keyword is extracted with the space as a delimiter. Therefore, the processing is simple compared to the case of the Japanese language. In the present embodiment, only words consisting of more than one character in the case of a word beginning with an upper-case letter, and words consisting of more than five characters in the case of a word beginning with a lower-case letter, are extracted as keywords. This is to avoid extracting articles and prepositions as keywords. Further, when there occurs a succession of words each beginning with an upper-case letter, such words are concatenated and regarded as one word. For example, from "Rosa Mota of Portugal", "Rosa Mota" and "Portugal" are extracted as keywords. A succession of words, each beginning with an upper-case letter, is thus regarded as one word because the names of persons and the like should be treated as inseparable words. As an example, from the first paragraph of unit 1 in FIG. 37, {"U. S. Olympic", "champion", "Joan Benoit Samuelson", "featured", "NYC Marathon"} are extracted. In the input data example of FIG. 37, it is assumed that the paragraphs in each unit are separated from each other by a blank line. In step a4, the same processing as for Japanese text data is performed. From the keywords extracted from unit 1 in FIG. 37, for example, Table 5 shown below is obtained as the keyword count results.
TABLE 5
Frequency of
Keyword occurrence
Olympic 2
champion 1
Joan Benoit Samuelson 1
featured 1
marathon 2
Rosa Mota 1
Portugal 1
In Table 5, the frequency of occurrence of "Olympic" is 2. This is because "U.S. Olympic" and "Olympic", which were separately selected, have been combined into one keyword since "Olympic" is contained within "U.S. Olympic" and the contained portion is longer than half the keyword length, as in the processing of Japanese text data. For "marathon" also, the frequency of occurrence is 2 because "NYC Marathon" and "Marathon" have been combined into one keyword. It will also be noted that when judging whether a keyword is a subset of another keyword or when judging a match between keywords, no distinction is made between upper-case and lower-case letters. In step a5 to step a9, the same processing as for Japanese text data is performed. For example, Table 6 shown below is obtained as the keyword frequency table.
TABLE 6
Intra-unit Inter-unit
Keyword frequency share count
Olympic 6 3
champion 1 0
Joan Benoit Samuelson 1 0
featured 1 0
marathon 7 3
Rosa Mota 3 2
Portugal 3 2
In step a10, the same processing as for Japanese text data is performed. For example, a prescribed number of keywords are retrieved in decreasing order of inter-unit share count from the keyword frequency table such as shown in Table 6. When three keywords are to be retrieved from Table 6, for example, "Olympic", "marathon", and "Rosa Mota" are retrieved. Here, "Portugal" and "Rosa Mota" are equal both in inter-unit share count and in intra-unit frequency, but "Rosa Mota" consists of 9 characters including the space while "Portugal" consists of 8 characters. In such a case, the one that has the larger character count is selected. The selected keywords are output to the output section 2, as shown in FIG. 38. In Japanese, the title is {character pullout}, but in English, "Trend" is the title. Next, one embodiment of the second invention will be described with reference to drawings. As one embodiment of the second invention, a teletext broadcast receiving apparatus is shown in which an information abstracting method is employed. FIG. 7 is a diagram showing the system configuration for the embodiment of the second invention. In FIG. 7, reference numeral 21 is a teletext broadcast receiving section for receiving a teletext broadcast; 22 is a channel storing section for storing channels of prescribed programs; 23 is a teletext keyword extracting section for extracting keywords contained in received programs on a program-by-program basis; 24 is a teletext score calculating section for calculating a score for each keyword in such a manner that a higher score is given to a keyword appearing in a larger number of programs; 25 is a teletext abstracting section for selecting keywords on the basis of the calculated scores and for extracting them as an information abstract; and 26 is a display section for displaying the result of abstracting. A hardware configuration for implementing the above-configured system is shown in FIG. 8. The configuration shown in FIG. 8 is fundamentally the same as that of a general-purpose computer system, and consists of the constituent parts of the hardware configuration shown in FIG. 2 as one embodiment of the first invention, plus the constituent parts of the system configuration shown in FIG. 7 as one embodiment of the second invention. Accordingly, the same constituent parts between them are designated by the same reference numerals, and their detailed explanation will not be repeated here. The operation of the thus configured teletext broadcast receiving apparatus will be described below with reference to the flowchart of FIG. 9. Since this invention is not directed to the teletext broadcast receiving system itself, the teletext broadcast receiving portion of the system is simply treated as one constituent part, that is, the teletext broadcast receiving section 21, and will not be described in detail here. In steps b1 and b2, one of the channels stored in the channel storing section 22 is retrieved, and in accordance with the retrieved channel, a broadcast is received by the teletext broadcast receiving section 21. In step b1, since one channel is retrieved from the channel storing section 22 each time the step is called, there will be left no more channels that can be retrieved when the step has been called for the (N+1)th time where N is the number of channels stored. In that case, the process proceeds to step b5, as shown in the flowchart. The channels stored in the channel storing section 22 are as shown in Table 7 below.
TABLE 7
Channels
4 ch 02#
6 ch 101#
: :
When the data shown in Table 7 are stored in the channel storing section 22, data 4ch and 02# are retrieved when step b1 is called for the first time. Such a channel designation is used because in the current teletext broadcast system a program is designated by a channel number plus a number followed by #. In step b2, the contents of the received teletext program are stored in the main storage 12. The contents may be stored in the external storage 13 instead. By storing in the channel storing section 22 a plurality of channels on which programs in a specific genre are broadcast, such as news programs or sports programs, for example, the recent trend in that specific genre can be extracted as the result of abstracting according to the present embodiment. In step b3, the teletext keyword extracting section 23 extracts a keyword from the contents of the teletext program stored in step b2. The keyword extraction is performed in the same manner as done in step a3 in the embodiment of the first invention. One program received on teletext broadcasting corresponds to one unit in the embodiment of the first invention. In step b4, the number of keywords extracted in step b3 is counted and stored (pushed) onto the stack in the same manner as in step a4 in the embodiment of the first invention. The keyword count results are stored in the stack in the same format as that of the count result table (Table 1) shown in the embodiment of the first invention. The processing in steps b5 to b9 is the same as steps a5 to a9 performed in the embodiment of the first invention. However, in the embodiment of the first invention, the format of the keyword frequency table is defined as shown in Table 2, but in the embodiment of the second invention, since teletext programs are used instead of units, the item names are changed as shown in Table 8 below.
TABLE 8
Intra-program Inter-program
Keyword frequency share count
The processing itself is the same as that performed in the embodiment of the first invention; that is, the same processing performed on the intra-unit frequency is performed on the intra-program frequency, and the same processing performed on the inter-unit share count is performed on the inter-program share count. Further, the processing performed by the score calculating section 4 in the embodiment of the first invention is performed by the teletext score calculating section 24 in the embodiment of the second invention. In step b10, the same processing as that in step a10 in the embodiment of the first invention is performed in the teletext abstracting section 25. The data example shown in FIG. 5 used in the description of the embodiment of the first invention is taken from teletext broadcasts, and the output result is the same as the example shown in FIG. 6 in the embodiment of the first invention. The output result is displayed on the display section 26. The processing performed when the embodiment of the second invention is applied to English text data is the same as the processing explained in connection with the embodiment of the first invention, and therefore, the explanation will not be repeated here. Data described in English can also be handled in the second invention, as in the first invention. Next, one embodiment of the third invention will be described below with reference to drawings. As one embodiment of the third invention, an information abstracting method and an information abstracting apparatus are shown wherein associations between keywords are considered in abstracting information. FIG. 10 is a diagram showing the system configuration for the embodiment of the third invention. Some of the constituent parts in FIG. 10 are the same as those shown in the configurational example of the embodiment of the first embodiment; therefore, the same constituent parts are designated by the same reference numerals and detailed explanation of such parts will not be given here. In FIG. 10, reference numeral 31 is a keyword associating section for establishing associations between the keywords extracted by the keyword extracting section 3 from the same paragraph within the same unit; 32 is an association score calculating section for calculating scores for the keywords extracted by the keyword extracting section 3 and the keyword associations established by the keyword associating section 31, in such a manner that higher scores are given to keywords and keyword associations appearing in a larger number of units; and 33 is an association abstracting section for selecting keywords and keyword associations on the basis of the scores calculated by the association score calculating section 32, and for displaying them as an information abstract on the output section 2. The hardware configuration for implementing the above-configured system is the same as the hardware configuration of the one embodiment of the first invention shown in FIG. 2. Therefore detailed explanation will not be given here. The operation of the thus configured information abstracting apparatus and information abstracting method wherein keyword associations are considered will be described with reference to the flowchart shown in FIG. 11. The processing in steps c1 to c3 is the same as steps a1 to a3 performed in the embodiment of the first invention. In step c4, keywords extracted from the same paragraph within the same unit are associated with each other in the keyword associating section 31. In keyword association, two keywords are paired together. Within each keyword pair consisting of two associated keywords, the order of the keywords is insignificant. That is, (Keyword 1, Keyword 2) and (Keyword 2, Keyword 1) are both regarded as the same associated keyword pair. Of separately extracted two keywords, if one is a subset of the other, these are regarded as the same keyword in the same manner as described in connection with step a4, and after combining such keywords into one keyword, the keyword association is performed. In unit 1 of the data example shown in FIG. 5, for example, the paragraphs are separated from each other by a blank line. From the first paragraph in unit 1 of this data example are extracted the keywords {character pullout}, {character pullout}{character pullout}, {character pullout}, {character pullout}{character pullout}, and {character pullout}{character pullout}, for which keyword associations {character pullout}{character pullout}, {character pullout}{character pullout}, {character pullout} {character pullout}{character pullout}{character pullout}, {character pullout} {character pullout}{character pullout}, {character pullout}{character pullout}, {character pullout} {character pullout}{character pullout}{character pullout}, {character pullout} {character pullout}{character pullout}, {character pullout} {character pullout}{character pullout}{character pullout}, {character pullout} {character pullout}{character pullout}, and {character pullout}{character pullout}{character pullout} {character pullout}{character pullout} are generated. From the illustrated data example, {character pullout} and {character pullout} are extracted as keywords, but these are combined into one keyword {character pullout} in the same manner as previously described in connection with a4. In step c5, the keywords and the keyword associations extracted in steps c3 and c4 are arranged in the form of a count table, and are stored (pushed) onto a FILO (first-in-last-out) stack in the same manner as in step a4 described in the embodiment of the first invention. When the same keyword or keyword association occurs more than one time, such keywords or keyword associations are added together and the total count is stored in the item of the frequency of occurrence. The format of the count table is as shown in Table 9 below.
TABLE 9
Frequency of
Keyword or Keyword association occurrence
In the item of "Keyword or Keyword association" in Table 9, a specific keyword or keyword association is entered as an item value. In the case of a keyword association, since two keywords are paired, the item value is entered in the format of (Keyword 1, Keyword 2). The count table obtained, for example, from the keywords and keyword associations extracted from the first paragraph in unit 1 of the data example shown in FIG. 5, will be as shown in Table 10 below.
TABLE 10
Frequency of
Keyword or Keyword association occurrence
{character pullout} 2
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 2
{character pullout} 2
{character pullout} 2
{character pullout} 2
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
{character pullout} 1
In Table 10, the frequency of occurrence of each keyword association is set to the same value as that of the keyword having the higher frequency of occurrence of the two associated keywords. In step c6, data consisting of a keyword or keyword association paired with its frequency of occurrence, stored in stack in step c5, is retrieved. In step c7, the value of the item "Keyword or Keyword association" in the data retrieved in step c6 is retrieved. If the value represents a keyword, the process proceeds to step c8; on the other hand, if the value represents a keyword association, the process proceeds to step c11. Whether the retrieved item value represents a keyword or a keyword association is judged in the following manner: if the item value is a combination of two keywords such as (keyword 1, keyword 2), the value is judged as representing a keyword association, and if it is simply one keyword such as Keyword 1, the value is judged as representing a keyword. In step c8 to c10, the same processing as in steps a7 to a9 in the one embodiment of the first invention is performed by the association score calculating section 32. The format of the keyword frequency table, however, is as shown in Table 11 below.
TABLE 11
Keyword or
Keyword Intra-unit Inter-field
association frequency share count
The keyword frequency table shown in Table 11 is similar in format to the table shown in Table 2, the only difference being that the item name "Keyword" is changed to the item name "Keyword or Keyword association". This is because not only a keyword but a keyword association is also stored as the item value. The processing in steps c8 to c10 is performed only when the value of the item "Keyword or Keyword association" in the data retrieved in step c6 represents a keyword; therefore, the processing is the same as that in steps a7 to a9, except that the item name is different. Step c11 is performed when the data retrieved in step c6 concerns a keyword association, in which case it is judged whether the keyword frequency table contains an association identical to the retrieved association. This is accomplished by sequentially comparing the keyword association retrieved in step c6 against the values in the item "Keyword or Keyword association" in the keyword frequency table. This judgement may be performed by using another efficient method. In judging the keyword association, association designations (Keyword 1, Keyword 2) and (Keyword 2, Keyword 1) are regarded as the same association, as previously described in connection with step c4. Further, when one keyword is a subset of another keyword, such keywords are combined into one keyword by using the same criterion and method as previously described in connection with step a7. Step c12 is performed when the keyword association retrieved in step c6 matches a keyword association already entered in the keyword frequency list. This means that identical keyword associations exist in different paragraphs. In this case, 1 is added to the value of "Inter-unit share count" in the keyword frequency table. Since keywords are associated on a paragraph basis, not on a unit basis, the item names "Inter-unit share count" and "Intra-unit frequency" should actually be "Inter-paragraph share count" and "Intra-paragraph frequency", respectively, but to share the same keyword frequency table with keywords, the item names for keywords are also used for keyword associations. As for the value of "Intra-unit frequency", the value of the frequency of occurrence of the pair retrieved in step c6 is added to the value of the item "Intra-unit frequency". For example, consider the case where the pair retrieved in step c6 is {character pullout}{character pullout} {character pullout}, 2). In this case, the value of the intra-unit frequency is increased by 2 and the value of the inter-unit share count is increased by 1 for the data whose item value of the attribute "Keyword or Keyword association" in the keyword frequency table is {character pullout}{character pullout} {character pullout}. Step c13 is performed when the keyword association retrieved in step c6 does not match any keyword associations stored in the keyword frequency table. In this case, the associated keyword pair retrieved in step c6 is added to the keyword frequency table, and the intra-unit frequency is set to the same value as the frequency of occurrence of the pair retrieved in step c6. The inter-unit share count is set to 0. The above steps c11 to c13 are performed in the association score calculating section 32. In step c14, the association abstracting section 33 retrieves a predetermined number of keywords from the keyword frequency table in decreasing order of inter-unit share count. Next, for each keyword retrieved, a predetermined number of associations are retrieved from keyword associations containing that keyword, in decreasing order of inter-unit share count. Finally, the results thus retrieved are displayed on the output section 2. For keywords or keyword associations having the same inter-unit share count, priority is given in decreasing order of intra-unit frequency. If the intra-unit frequency also is the same, priority is given in retrieval in decreasing order of character counts in the case of keywords and in decreasing order of character counts of associated keywords in the case of keyword associations. Consider, for example, the case where two keywords and two associations for each of the keywords are retrieved from the keyword frequency table shown in Table 12 on page 133. In this case, if two keywords are selected in decreasing order of inter-unit share count, {character pullout}{character pullout} and {character pullout}{character pullout} are retrieved. The keyword {character pullout}{character pullout} has the same inter-unit count as the keyword {character pullout}{character pullout} or {character pullout}, but {character pullout}{character pullout} is selected based on its intra-unit frequency. When keyword associations containing the retrieved keywords are selected based on their inter-unit share counts, then {character pullout}{character pullout} {character pullout}{character pullout}{character pullout} {character pullout}, {character pullout}{character pullout} {character pullout} and {character pullout} {character pullout}{character pullout} are retrieved, the associated keywords being {character pullout}, {character pullout}, {character pullout}{character pullout}, and {character pullout}. A display example of the thus retrieved keywords and keyword associations is shown in FIG. 12. In FIG. 12, a combination of three keywords represents one topic. Another method of selecting keywords and keyword associations may be by retrieving predetermined numbers of keywords and keyword associations in decreasing order of the weighted sum of the intra-unit frequency and inter-unit share count. To obtain the weighted sum, constants S and T denoting weights are determined first, and then, (intra-unit frequency.times.S+inter-unit share count.times.T) is calculated for each keyword. As another method of extracting the result of abstracting in step c14, a method that can extract topics more accurately will be described below. First, a keyword that has the largest inter-unit share count is selected, and this selected keyword is denoted by A. Then, of the keywords associated with the keyword A, a keyword that provides the largest inter-unit share count when associated with the keyword A, is selected, and this selected keyword is denoted by B. Next, of keywords associated with both of the selected keywords A and B, a keyword that provides the largest combined inter-unit share count when associated with the respective keywords, is selected, and this selected keyword is denoted by C. Then, the keywords A, B, and C are combined to represent a first topic. To extract a second topic, first the keywords representing the first topic and the keyword associations containing these keywords are deleted from the keyword frequency table, and then the same processing as described above is performed. In making selection, if there are two or more keywords or keyword associations having the same inter-unit share count, their intra-unit frequencies, keyword lengths, etc. are compared, as described earlier. As an example, the processing performed when extracting two topics from Table 12 will be described below. When extracting the first topic, {character pullout}{character pullout} is selected as the keyword A, and {character pullout} which provides the largest inter-unit share count when associated with the keyword A, is selected as the keyword B. Further, {character pullout}, which is associated with both of these keywords, and which provides the largest combined inter-unit share count when associated with the respective keywords, is selected as the keyword C. When extracting the second topic, first the keywords {character pullout}{character pullout}, {character pullout}, and {character pullout}, used to extract the first topic, and the keyword associations containing these keywords, are deleted from the keyword frequency table, and then the keywords A, B, and C are selected in the same manner as described above. From Table 12, {character pullout}{character pullout}, {character pullout}{character pullout}, and {character pullout} are selected as the keywords representing the second topic. Next, the operation will be described when the embodiment of the third invention is applied to another example. In the example hereinafter described, data input through the input section 1 is described in English. For example, consider the data example shown in FIG. 37, as in the case of the embodiment of the first invention previously described. The following description deals with the processing of English text data, focusing on differences from the processing of Japanese text data. The flowchart of FIG. 11, the same one used for explaining the processing of Japanese text data, will be used. The processing in steps c1 and c2 is the same as described in the processing of Japanese text data. In step c3, the difference from the processing of Japanese text data is the same as that described in connection with step a3 when the embodiment of the first invention was applied to the English data example. The processing in steps c4 to c14 is the same as described in the processing of Japanese text data. An example of the count table obtained in step c5 is shown in Table 13.
TABLE 13
Frequency of
Keyword or Keyword association occurrence
Olympic 2
champion 1
Joan Benoit Samuelson 1
featured 1
marathon 2
Rosa Mota 1
Portugal 1
(Olympic, champion) 2
(Olympic, Joan Benoit Sauuelson) 2
(Olympic, featured) 2
(Olympic, marathon) 2
(Olympic, Rosa Mota) 2
(Olympic, Portugal) 2
(champion, Joan Benoit Samuelson) 1
(champion, featured) 1
(champion, marathon) 2
(Joan Benoit Samuelson, featured) 1
(Joan Benoit Samuelson, marathon) 2
(featured, marathon) 2
(marathon, Rosa Mota) 2
(marathon, Portugal) 2
(Rosa Mota, Portugal) 1
An example of the keyword frequency table is shown in Table 14. In step c14, first a keyword that has the largest inter-unit share count is selected from the keyword frequency table, and this selected keyword is denoted by A. Then, of the keywords associated with the keyword A, a keyword that provides the largest inter-unit share count when associated with the keyword A, is selected, and this selected keyword is denoted by B. Next, of keywords associated with both of the selected keywords A and B, a keyword that provides the largest combined inter-unit share count when associated with the respective keywords, is selected, and this selected keyword is denoted by C. Then, the combination of the keywords A, B, and C is output as data representing one topic. More than one topic represented by such a keyword combination may be extracted. For example, when two such topics are extracted from entries whose values are explicitly given in Table 14, two keyword combinations {marathon, Rosa Mota, Portugal} and {Olympic, Gelindo Bordin, Italy} are obtained. When extracting the second topic, first the keywords representing the first topic and the keyword associations containing these keywords are deleted from the keyword frequency table, and then the second topic is extracted in the same manner as described above. An example of the output result is shown in FIG. 39.
TABLE 14
Keyword or Intra-unit Inter-unit
Keyword association frequency share count
Olympic 6 5
champion 1 0
Joan Benoit Samuelson 1 0
: : :
(Olympic, champion) 1 0
(Olympic, Joan Benoit Sauuelson) 1 0
: : :
Rosa Mota 3 2
Portugal 3 2
marathon 7 6
: : :
(Rosa Mota, Portugal) 3 2
(Rosa Mota, marathon) 3 2
(Rosa Mota, Olympic) 2 1
(Portugal, marathon) 3 2
: : :
Gelindo Bordin 2 1
Italy 2 1
overtakes 1 0
: : :
(Gelindo Bordin, Italy) 2 1
(Gelindo Bordin, overtakes) 2 0
(Gelindo Bordin, Olympic) 6 1
(Gelindo Bordin, marathon) 7 1
(Italy, Olympic) 2 1
: : :
In the above embodiment, processing has been performed on the same paragraph within the same unit, but processing may be performed on a sentence-by-sentence basis, instead of a paragraph-by-paragraph basis. Further, in the above-described invention, both the intra-unit frequency and the inter-unit share count were calculated for each keyword, based on which the keywords to be displayed as an abstract were selected. Another method of selection is also possible that relies only on the inter-unit share count. Such a method is effective because there are cases in which the same keyword appears repeatedly within the same unit simply because of rephrasing or for reasons of expression; in such cases, the intra-unit frequency may become large because of reasons of expression rather than the importance as a keyword. Therefore, a method is also needed that performs processing by using only the inter-unit share count without calculating the intra-unit frequency, for simplicity of processing. When displaying keywords as an abstract of a document, often the general content of the document can be communicated by a combination of keywords; in such cases, the abstracting method can be simplified and the implementation in an actual apparatus can be facilitated by treating only the importance of association between keywords rather than treating the importance of each individual keyword. Such a method will be described below. First, in step c5, the processing on keywords is not performed, but the processing is performed only on keyword associations. Further, the processing on the frequency of occurrence is also not performed. More specifically, keywords are not stored in the stack, but only keyword associations are stored in the stack. In this case, since the processing on the frequency of occurrence is not performed, the item of "Frequency of occurrence" is removed from the count table format shown in Table 9. That is, the table consists only of the item "Keyword or Keyword association" where only a keyword association is entered as an item value. Identical keyword associations are combined into one association. An example of the count table is shown in Table 15.
TABLE 15
Keyword or Keyword association
{character pullout}
{character pullout}
{character pullout}
{character pullout}
{character pullout}
{character pullout}
{character pullout}
{character pullout}
{character pullout}
{character pullout}
Further, steps c8 to c10 are eliminated, and in step c7, no judging operation is performed and the branch to step c11 is always followed. In steps c12 and c13, since the data retrieved in step c6 does not contain the item "Frequency of occurrence", the processing on the intra-unit frequency, which uses the value of this item, is not performed. As a result of the above processing, a keyword frequency table such as shown in Table 16 is obtained which consists only of the keyword associations found in the keyword frequency table shown in Table 12 and does not have the item "Intra-unit frequency".
TABLE 16
Keyword or Inter-unit
Keyword association share count
{character pullout} 1
{character pullout} 0
{character pullout} 0
{character pullout} 2
{character pullout} 0
{character pullout} 3
{character pullout} 0
{character pullout} 2
{character pullout} 0
{character pullout} 0
{character pullout} 1
{character pullout} 0
{character pullout} 1
{character pullout} 0
{character pullout} 0
{character pullout} 0
{character pullout} 0
{character pullout} 0
Further, in step c14, a keyword association that has the largest inter-unit share count is retrieved from among the keyword associations in the keyword frequency table. For example, {character pullout} {character pullout} {character pullout} 3) is retrieved from Table 16. Next, of the keywords associated with both of the two keywords A and B constituting the retrieved keyword association, a keyword that provides the largest combined inter-unit share count when associated with A and B, is selected. When we consider the keyword frequency table shown in Table 16, for example, the keywords A and B correspond to {character pullout} {character pullout} and {character pullout}, and the keyword associated with both of them and having the largest combined inter-unit share count is {character pullout}. Accordingly, the keyword {character pullout} is selected. The keywords A and B and the keyword selected in the above manner are combined together to represent one topic. When obtaining another topic as an information abstract, the keyword associations containing the keywords already selected are deleted from the keyword frequency table, and from among the remaining keyword associations, three keywords are selected in the same manner as described above. More specifically, in the above example, first the keywords {character pullout} {character pullout}, {character pullout}, and {character pullout} are selected, and then the keyword associations containing these keywords are deleted; after that, ({character pullout} {character pullout}, {character pullout}, 2) is retrieved as the keyword association having the largest inter-unit share count. In this case, the keywords A and B are {character pullout} {character pullout} and {character pullout} {character pullout}, and the keyword associated with both of them and having the largest combined inter-unit share count is {character pullout}, which is therefore selected. As a result, {{character pullout} {character pullout}, {character pullout}, {character pullout}} and {{character pullout} {character pullout}, {character pullout} {character pullout}, {character pullout}} are retrieved as information abstracts, producing an output result as shown in FIG. 12. It will be recognized that the method that relies only on the inter-unit share count, as described above, can also be applied to English text data, with the earlier described differences in processing from Japanese text data. Next, one embodiment according to the fourth invention will be described with reference to drawings. As one embodiment of the fourth invention, a teletext broadcast receiving apparatus is shown which is equipped with an information abstracting function wherein associations between keywords are considered in abstracting information. FIG. 13 is a diagram showing the system configuration for the embodiment of the fourth invention. Some of the constituent parts in FIG. 13 are the same as those shown in the configurational example of the embodiment of the second invention; therefore, the same constituent parts are designated by the same reference numerals and detailed explanation of such parts will not be given here. In FIG. 13, reference numeral 41 is a teletext keyword associating section for establishing associations between the keywords extracted by the teletext keyword extracting section 23 from the same paragraph within the same program; 42 is a teletext association score calculating section for calculating scores for each keyword and each keyword association in such a manner that a higher score is given to a keyword or keyword association appearing in a larger number of programs; and 43 is a teletext association abstracting section for selecting keywords and keyword associations on the basis of the scores calculated by the teletext association score calculating section 42, and for displaying them on the output section 26. The hardware configuration for implementing the above-configured system is the same as the hardware configuration of the embodiment of the second invention shown in FIG. 8. Therefore detailed explanation will not be given here. The operation of the thus configured teletext broadcast receiving apparatus equipped with an information abstracting function wherein keyword associations are considered will be described with reference to the flowchart shown in FIG. 14. The processing in steps d1 to d3 is the same as steps b1 to b3 performed in the embodiment of the second invention. In steps d4 to d14, the same processing as steps c4 to c14 in the embodiment of the third invention is performed, except that step d4 is performed in the teletext keyword associating section 41, steps d8 to d13 in the teletext association score calculating section 42, and step d14 in the teletext association abstracting section 43. In this invention, as in the third invention, a method of selection is considered that relies only on the inter-unit share count. Since the processing in steps d4 to d14 is fundamentally the same as the processing performed in steps c4 to c14, this method can be implemented by making the same modifications as described in connection with the third invention, and therefore detailed explanation will not be given here. Further, a method can also be considered that performs processing only on keyword associations, as described in the embodiment of the third invention. Next, one embodiment according to the fifth invention will be described with reference to drawings. As one embodiment of the fifth invention, an information abstracting method and an information abstracting apparatus are shown. FIG. 16 is a diagram showing the system configuration for the embodiment of the fifth invention. Some of the constituent parts in FIG. 16 are the same as those shown in the configurational example of the embodiment of the first invention; therefore, the same constituent parts are designated by the same reference numerals and detailed explanation of such parts will not be given here. FIG. 16 differs from the system configuration of the embodiment of the first invention in that a similarity score calculating section 51 is provided which calculates similarity between keywords extracted by the keyword extracting section 3. The hardware configuration for implementing the above-configured system is the same as the hardware configuration of the embodiment of the first invention shown in FIG. 2. The operation of the thus configured information abstracting apparatus and information abstracting method will be described with reference to the flowchart shown in FIG. 17. The processing in steps e1 to e3 is the same as steps a1 to a3 performed in the embodiment of the first invention. Steps e4 to e9 are performed in the similarity score calculating section 51. In step e4, for the keywords extracted in step e3 the frequency of occurrence is calculated by taking similarity into account. In calculating the frequency of occurrence, keywords that exactly match are combined and the frequency of occurrence is set to the same value as the number of exactly matching keywords. For example, when four identical keywords {character pullout} {character pullout} are extracted, these keywords {character pullout} {character pullout} are combined into one keyword and its frequency of occurrence is set to 4. Next, for nonidentical keywords A and B, the number of characters that match in both character and character order between them is divided by the mean between the character counts of the keywords A and B, and the resulting value is used as the similarity. For example, between {character pullout} {character pullout} and {character pullout} {character pullout}, {character pullout} {character pullout} is common, so that five characters match in both character and character order between them. The mean between the character counts of the two keywords is 6 characters. As a result, the similarity between the two characters is calculated as 5/6=0.83 (calculated down to two decimal places). Further, denoting the similarity calculated between the two keywords A and B as S, (S.times.frequency of occurrence of B) is added to the frequency of occurrence of A, and (S.times.frequency of occurrence of A) is added to the frequency of occurrence of B. For example, suppose that, before the additions to the frequency of occurrence using the similarity, the frequency of occurrence of {character pullout} {character pullout} was 3 and that of {character pullout} {character pullout} was 5. Then, using the similarity 0.83, the occurrence of frequency of {character pullout} {character pullout} is calculated as 3+5.times.0.83=7.17, and the occurrence of frequency of {character pullout} {character pullout} is calculated as 5+3.times.0.83=7.49. Such keyword similarity is calculated for every combination of keywords extracted in step e3. To reduce the calculation time, one method that can be considered is by calculating similarity only for keyword combinations in which more than half the number of characters match between two keywords. Since the frequency of occurrence is calculated on the basis of similarity as described above, the keyword count results obtained include frequencies expressed by numbers other than integers, as shown in Table 17 on page 134. The keyword count results obtained are each stored (pushed), with an identifier identifying the unit, onto the FILO (first-in-last-out) stack. In the embodiment of the first invention, in step a4, keywords that are considered identical are combined into one keyword, and similarity between the keywords is not taken into account. Therefore, when updating the intra-unit frequency and inter-unit share count in the keyword frequency table in the subsequent steps a5 to a9, the situation cannot occur where a keyword extracted from one unit is considered identical to another keyword extracted from the same unit, even if the keyword is not stored in the stack with an identifier identifying the unit. As a result, the situation cannot occur where keywords extracted from the same unit are considered identical in step a7 and the inter-unit share count is updated in the subsequent step a8. On the other hand, when keyword similarity has been taken into account, as in the present embodiment, if an identifier identifying the unit were not stored in the stack, similarity would be calculated, when updating the keyword frequency table, regardless of whether the keywords were extracted from the same unit or from different units. As a result, it would become impossible to distinguish between a keyword extracted from a different unit and a keyword extracted from the same unit, and the inter-unit share count, which indicates an occurrence of an identical keyword in a different unit, could not be calculated in the frequency correction operation in step e7 hereinafter described. For this reason, an identifier identifying the unit is attached to each keyword stored in the stack, which is a feature different from the embodiment of the first invention. The identifier need only identify one unit from the other, and may be generated such as b1, b2, . . . , during the processing. The processing in step e5 is the same as that performed in step a5 in the embodiment of the first invention, the only difference from step a5 being that the data retrieved in step e5 has an unit identifier attached to it. In step e6, the identifier identifying the unit attached to the data retrieved in step e5 is compared with the contents of a variable Bid. The initial value of the variable Bid is so set as to be equal to any identifier and, at the same time, have the value of the identifier to be compared. For example, an integer larger than 0 is assigned to the unit identifier, and the initial value of Bid is set to -1; in step e6, the comparison is performed by first substituting the identifier to be compared into Bid only when Bid is -1. When they match as the result of the comparison, the process proceeds to step e7; otherwise, the process proceeds to step ell. In step e7, the same processing as that in step a6 in the embodiment of the first invention is performed. In step e8, similarity is calculated between the keyword retrieved in step e7 and each keyword in the keyword item in the keyword frequency table, such as shown in Table 2 in the embodiment of the first invention. In calculating the similarity, a flag variable or the like is used to identify whether an identical keyword (similarity=1) exists in the keyword frequency table so that the existence of any identical keyword can be identified in the subsequent step e9. Next, if the similarity is greater than or equal to 0.5, the intra-unit frequency and inter-unit share count in the keyword frequency table are updated. The following method is used to update the intra-unit frequency and inter-unit share count. The keyword item value of the data retrieved in step e5 is denoted by A, the item value of the frequency of occurrence by F, the keyword item value in the keyword frequency table by B, the intra-unit frequency by Fi, and the inter-unit share count by Fe. When the similarity between keywords A and B is S, the value of the intra-unit frequency in the keyword frequency table is updated to Fi+S.times.F, and the inter-unit share count is updated to Fe+S. For example, consider the case in which the data extracted in step e5 is (Keyword, Similarity)={character pullout} {character pullout}, 2.63) and update operations are performed on the keyword frequency table shown in Table 18 below. Note, however, that the numeric values used in this example are not calculated from actual data, but are given only for explanation purposes.
TABLE 18
Intra-unit Inter-unit
Keyword frequency share count
{character pullout} 2.32 1.87
{character pullout} 1.67 0
{character pullout} 4.15 2.71
{character pullout} 3.34 2.68
: : :
In Table 18, the keywords having a similarity of 0.5 or greater with respect to {character pullout} {character pullout} are {character pullout} {character pullout} and {character pullout} {character pullout}. The similarity between {character pullout} {character pullout} and {character pullout} {character pullout} is 0.67, and the similarity between {character pullout} {character pullout} and {character pullout} {character pullout} is 0.83 (fractions rounded off to two decimal places). As a result, (Keyword, Intra-unit frequency, Inter-unit share count)={{character pullout} {character pullout}, 4.15, 2.71), {character pullout} {character pullout}, 3.34, 2.68)} in the keyword frequency table shown in Table 18 are updated to {{character pullout} {character pullout}, 4.15+0.67.times.2.63, 2.71+0.67), {character pullout} {character pullout}, 3.34+0.83.times.2.63, 2.68+0.83)}. In the present embodiment, of the keywords in the keyword frequency table, only the keywords having a similarity of 0.5 or greater with respect to the data retrieved in step e5 are updated, but other criteria may be used. Update operations are performed on keywords having a similarity of 0.5 or greater in order to reduce the number of update operations. When step e8 is called for the first time after initiating the flowchart of FIG. 17, the keyword frequency table is empty, so that the process proceeds to step e9 without performing any operations in step e8. In step e9, it is judged whether any keyword identical to the keyword retrieved in step e7 has been found in the keyword frequency table during the updating of the keyword frequency table in step e8. If any identical keyword has been found, the process returns to step e5; otherwise, the process proceeds to step e10. In step e10, the data retrieved in step e5 is temporarily stored in a storage area such as a stack. This stack is a different stack from the one used in step e4. In step e11, the data stored in a storage area such as a stack are retrieved one by one and added to the keyword frequency table. The processing performed when adding each data to the keyword frequency table is the same as that performed in step a9 in the embodiment of the first invention. Further, in step e11, the retrieved data is deleted from the storage area. Therefore, each time this step is completed, the stack becomes emptied of all data stored in step e10. In step e12, the identifier identifying the unit of the data retrieved in step e5 is substituted into the variable Bid. In step e13, the same processing as in step e11 is performed. In step e14, the same processing as in step a10 in the embodiment of the first invention is performed, though it is different in that the values of the keyword frequency of occurrence, intra-unit frequency, and inter-unit share count are not integers but real numbers having fractional parts. Next, one embodiment of the sixth invention will be described with reference to drawings. As one embodiment of the sixth invention, a teletext broadcast receiving apparatus equipped with an information abstracting function is shown. FIG. 18 is a diagram showing the system configuration for the embodiment of the sixth invention. Some of the constituent parts in FIG. 18 are the same as those shown in the configurational example of the embodiment of the second invention; therefore, the same constituent parts are designated by the same reference numerals and detailed explanation of such parts will not be given here. In FIG. 18, reference numeral 61 is a teletext similarity score calculating section for calculating similarity between the keywords extracted by the teletext keyword extracting section 23. The hardware configuration for implementing the above-configured system is the same as the hardware configuration of the embodiment of the second invention shown in FIG. 8. Therefore detailed explanation will not be given here. The operation of the thus configured teletext broadcast receiving apparatus equipped with an information abstracting function wherein similarity between keywords is considered will be described below with reference to the flowchart shown in FIG. 19. In steps f1 to f3, the same processing as in steps b1 to b3 in the embodiment of the second invention is performed. Steps f4 to f13 are performed in the teletext similarity score calculating section 61. The processing in steps f4 to f13 is the same as that in steps e4 to e13 performed in the embodiment of the fifth invention, except that the item names of the keyword frequency table are the same as that used in the embodiment of the second invention, and that the same processing as performed on the intra-unit frequency in the fifth invention is performed on the intra-program frequency and the same processing as performed on the inter-unit share | ||||||
