Document processing system deciding apparatus provided with selection functions4876665Abstract A document processing apparatus has a heading word dictionary, a heading extractor, a heading rule dictionary, a heading decision section, a document architecture rule dictionary and, a document architecture decision section, for deciding a logical document architecture. The apparatus further comprises a rule application decision section and a candidate selection indication section to allow the operator to select any desired document architecture, when the document architecture decision section decides plural document architecture candidates exist in accordance with document architecture rules, thus improving the operability of the system. Further, the past rule application record information (priority order) is stored and updated so as to provide a learning function for providing better operability. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
Rules for Heading
______________________________________
Condition 1: A reserved word is not included.
Condition 1-1: A heading word is included.
Condition 1-1-1: A reserved heading word is included.
Condition 1-1-1-1: A chapter heading is not included in
the previous part.
(Result) Indicates a chapter heading.
.fwdarw.
A symbol portion, an alphanumeric
portion, a punctuation portion, or a
tail sysmbol is defined as a main
heading pattern.
Condition 1-1-2: A reserved heading word is not
included.
Condition 1-1-2-2: A chapter heading is present in the
previous part.
Condition 1-1-2-2-1:
Matching with a chapter heading
pattern is successful.
(Result) Indicates a chapter heading.
.fwdarw.
The order of the chapter heading
pattern is incremented by one.
Condition 1-1-2-2-2:
This heading pattern does not match
the previous chapter heading.
Condition 1-1-2-2-2-1:
An itemized pattern is not present in
the previous part.
(Result) Indicates an itemized pattern
candidate.
Condition 1-1-2-2-2-2:
An itemized pattern is present in the
previous part.
Condition 1-1-2-2-2-
This heading pattern matches the
2-1: itemized pattern candidate.
(Result) Indicates an itemized pattern.
The order of the itemized pattern is
incremented by one.
______________________________________
The logical architecture of a sentence or word, as determined by the document architecture decision section 9 in accordance with the above rules, is stored in the logical architecture storage 10. The display controller 4 controls the display 5 to display the document data according to the document logical architecture stored in the logical architecture storage 10. The operation of the document processing apparatus will now be described with reference to the flow chart shown in FIG. 5. When document data is input to the input device 2 (step a), the input document data is sequentially stored in the original document storage 3. At the same time, the input document data is segmented into a plurality of blocks by the document processor 1, as shown in FIG. 2. In this segmentation processing, a line return codes etc. are determined as segmentation codes. The input document data is segmented in units of blocks at the segmentation codes. In this case, the segmentation sentence length is measured by counting characters. If the measured value falls within a predetermined value (e.g., 40 characters), the sentence is determined as having the possibility of being a heading sentence. If the segmented sentence is determined as having the possibility of being a heading sentence according to the measured number of characters, the heading extractor 6 decides whether a character string (words, phrases, or symbols) constituting the segmented sentence is registered in the heading word dictionary 7 (step b). For example, when the sentence "1. Introduction" in the input document data is extracted, it is checked as to whether it is registered in the heading dictionary 7. In this case, "1", "." and "Introduction" are retrieved from the heading dictionary 7, and the sentence is determined as being a heading candidate A (step c). When a heading candidate decision is performed, the heading decision section 8 accesses the heading rule dictionary 8a to determine whether the candidate A is a heading word (step d). If the candidate A is defined by any one of the rules shown in FIGS. 4A to 4D, the candidate A is determined as being heading word B (step e). In this case, the type of heading word is determined according to the applied heading rule. If the sentence segmented by the document processor 1 does not correspond to any heading word registered in the dictionary 7, or if the segmented sentence does not coincide with any heading rule although it is determined as being a heading candidate word, the segmented sentence is determined as being a sentence not included in the heading word rules (step f). The sentence determined as being a heading word, and the sentence determined as not being a heading word are input to the document architecture decision section 9 in order to determine their document architecture. When the document architectures are determined, the decision section 9 determines whether the sentence architectures correspond to document architecture rules (Tables 1 to 4) stored in the rule dictionary 9a (step g). If the architecture of the input document is defined by one of the document architecture rules, the document architecture data corresponding to the determined rule is stored in the storage 10 (step h and i). With reference to the example of segmented sentences as shown in FIG. 2, the above method of determining the document architecture will be described in further detail. In the segmented sentences in FIG. 2, the sentence of the first line, i.e., "document understanding system", and the sentence of the second line, i.e., "Okawa Tara" are not stored in the dictionary 7. These sentences are decided by the extractor 6 not to be heading words. However, the sentence of the first line is defined by a rule representing a noun phrase appearing at the head of the document, and the decision section 9 decides that "document understanding system" is a title. The sentence of the second line, "Okawa Taro" is a proper noun representing a male name. Since the male name follows the title, the name is determined as being an author's name. The results obtained by the document architecture decision as described above are stored in a form, as shown in FIG. 6A, in logical architecture storage 10. In the sentence of the third line, i.e., "1. Introduction", three words, i.e., "1", ".", and "Introduction" are stored in the dictionary 7. Therefore, this sentence is determined as being a heading candidate sentence A1 (See FIG. 7A). At the same time, the categories constituting this sentence are recognized as a numeric portion, a punctuation portion, and a heading candidate word, respectively. The heading decision section 8 accesses the heading rule dictionary 8a to determine whether the sentence determined as being heading candidate A1 is defined by the heading rules. In this case, the order of the categories constituting candidate word A1 is analyzed. The decision section 8 determines whether the order satisfies any one of the conditions in FIGS. 4A to 4D. The first numeral "1" is defined by the rule d shown in FIG. 4D. The numeral "1" and punctuation portion "." are defined by the rule b shown in FIG. 4B. Therefore, "1." is determined as being a heading symbol according to the rule b shown in FIG. 4B. "Introduction" is defined by the rule c shown in FIG. 4C, and is determined as being a heading word. The relationship between the heading symbol and the heading word is defined by the rule a shown in FIG. 4A. The heading candidate A1 is thus decided as heading B1. The above decision process is shown in FIG. 7A. In the above decision process, if the categories are not defined by the rules a, b, c, d shown in FIGS. 4A to 4D, heading candidate A1 is determined as not being a heading word. The document architecture decision section 9 determines the document architecture of heading B1 in accordance with the rules in table 1 to 4. In this case, the logical architecture of the analyzed sentence is stored in the storage 10, as shown in FIG. 6A. No chapter heading is indicated in the stored logical architectures. Heading B1, i.e., "1. Introduction" is defined by conditions (1), (1-1), (1-1-1), and (1-1-1-1) in Table 1 so that "1. Introduction" is determined constituting chapter heading C1 as shown in FIG. 7A. According to this decision, the logical architecture containing the chapter heading is stored in the logical architecture storage 10, as shown in FIG. 6B. Since the number of characters of the sentence of the fourth and fifth lines shown in FIG. 2 exceeds the number for determining the possibility of a sentence being a heading word, this sentence is therefore determined as being other than a heading. As defined by the rule in Table 3, the sentence of the fourth and fifth lines is determined as being a sentence constituting a paragraph. The sentence of the sixth line "2. Features of System" is recognized as heading candidate A2 in the same procedures as for heading candidate A1. In this case, the sentence of the sixth line is analyzed by the steps in FIG. 7B and is determined as being a heading B2. The heading B2 is compared with the rules in Table 2 to determine it coincides with a specific one of the rules. The heading B2 is defined by conditions (1-1), (2-1), (3-1), and (4-1), and is determined as having the possibility of being of the same level as that of chapter heading C1 "1. Introduction". In this way, it is determined whether the heading B2 is defined by the rules in Table 1. In other words, "2. Features of System" satisfies conditions (1), (1-1), (1-1-2), and (1-1-2-2-1), and thus, the heading word B2 is determined as constituting chapter heading C2. The resultant logical architecture data is stored in the storage 10 as shown in FIG. 6C. The same processing as described above is performed for the sentences of the seventh and subsequent lines, and the document architectures of these sentences are stored in the storage 10, as shown in FIGS. 6D and 6E. More specifically, for the sentence of the seventh line, heading candidate A3 is analyzed, as shown in FIG. 7C, and then is determined as being heading B3 according to the rules shown in FIGS. 4A to 4D. In the document architecture decision section 9, the heading B3 is compared with the rules in Table 2. Since the pattern of heading B2 does not previously appear, matching is unsuccessful. As a result, heading B3 is determined as being a heading having a level different from those of the previous headings. Heading B3 is checked in accordance with document architecture rules in Table 1 and is found to coincide with conditions (1), (1-1), (1-1-2), (1-1-2-2), and (1-1-2-2-2-1). Therefore, heading B3 is determined as being itemizing heading C3. Similarly, since the sentence of the eighth line satisfies conditions (1-1), (2-1), (3-1), and (4-1), the level of the heading corresponding to the sentence of the eighth line is determined as being possibly the same as that of the itemized heading of the seventh line. The sentence of the eighth line is determined as satisfying conditions (1), (1-1), (1-1-2), (1-1-2-2), (1-1-2-2-2), (1-1-2-2-2-2), and (1-1-2-2-2-2-1) in Table 1 and therefore determined as an itemized heading, being stored as shown in FIG. 6D. With respect to the ninth line "This system is . . . ", it is possible to consider this paragraph as having two cases or two candidates. That is, the first case is that the ninth line is a part of the eighth line itemized heading or "2 High recognition rate", while the second case is that ninth line is a paragraph having the same level as that of the sixth line chapter heading or "2. Feature of System". Therefore, in the apparatus according to the present invention, the apparatus is so configured as to allow the operator to select any one of the candidates. To achieve the above-mentioned object, the apparatus further comprises a rule application decision section 12 and a candidate selection indication section 14 as depicted in FIG. 1. The rule application decision section 12 is allowed to be accessible to the document architecture rule dictionary 9a to check a rule name requesting candidate selection and to retrieve flags corresponding to the rule name from a table (not shown) whenever two or more candidates are decided. The candidate selecting and indicating section 14 is accessible to a candidate selection key arranged in the document input device 2 to update flags so that any desired document architecture can be selected. In FIG. 5, when a decided candidate does not match with a single document architecture rule or when plural candidates are created (in step g), control allows the rule application decision section 12 to be accessible to the document architecture rule dictionary 9a. The above-mentioned candidate selection function is the feature of the present invention. As already explained, the document architecture decision section 9 determines whether the sentence architectures correspond to the document architecture rules (Tables 1 to 4) stored in the document architecture rule dictionary 9a. In this case, there exists the case where the determined heading candidate word matches a plurality of rules and therefore it is impossible to univocally determine the document architecture. In this case, a plurality of artitecture candidates are written in the logical architecture storage 10 under the control of the document processor 1, and any one of the candidates is selected by the candidate selection indication section 14 and displayed on the display unit 5 under the control of the display controller 4. The already-explained example of the ninth line shown in FIG. 2 will be described in further detail. The two architecture candidates are the case where "This system is . . . " is determined as a part of the eighth line itemized heading " 2 . High recognition rate" and the case where the same ninth line paragraph is determined as a paragraph having the same level as that of the sixth line chapter heading "2. Features of system". Since there exists no such rule as described above in the heading rule dictionary shown in FIG. 4, the above two sentences are recognized as "paragraph" in accordance with the document architecture rules shown in Table 4. Either of two paragraphs is written in the logical architecture storage 10 as shown in FIG. 6E or 6F, in accordance with the flow chart as shown in FIG. 9. More specifically, in this case, condition 1-1 in Table 4 matches, so that control requests the rule application decision section 12 to set an applied flag information as to which results d.sub.1 or d.sub.2 should be executed. In response to a request of the applied flag information setting, the section 12 checks from which rule the request is generated (step 91). In this case, since the request is generated from the rule (d), control retrieves applied flags corresponding to the above rule named from an application rule decision table as shown in FIG. 8 (step 92), and display the table on the display unit 5 so that the applied flags corresponding to the condition name can be checked (step 93). Here, if the rule 2-1 is required to be decided (step 94), the operator sets the flag information X.sub.1 to ON (step 95) and the flag information X.sub.2 to OFF (step 96). By this, the condition (2-1) of Table 4 is selected to execute the result d1, so that the level of the current "paragraph" of the ninth line is determined as a part of the eighth line itemized heading, being written in the storage 10 as shown in FIG. 6E. The above document architecture determined by the operator can be changed by the use of a candidate selection key arranged in the input device 2. That is, if the candidate selection key is depressed in the input device 2, the document processor 1 decodes the contents of the key and sends the information to the candidate selection indication section 14. This section 14 updates the applied flag in such a way that the flag is OFF in condition 2-1 but ON in condition 2-2 in the rule application table stored in the rule application decision section 12. Therefore, in the rule application decision section 12, ON is set to the applied flag information X.sub.2 (step 97) and OFF is set to the applied flag information X.sub.1 (step 98) so as to apply the condition 2-2, so that the result d.sub.2 is executed and the paragraph of the ninth line is determined as the "paragraph" having the same level as that of the sixth line chapter heading shown in FIG. 2. Accordingly, the document architecture is rewritten in the logical architecture storage 10 as shown in FIG. 6F. As described above, the rule to be applied can be switched by changing the application flag as explained above. In the above documents, any one of two candidates is selected. However, in the case of the presence of several candidates, it is possible to control the apparatus in such a way that application flags are turned on in sequence by repeatedly depressing the candidate selection key. Additionally, in the present invention, rule application record information y.sub.1, y.sub.2 . . . indicative of past rule application situations is added into the application rule table as shown in FIG. 8. That is, whenever a rule is selected by the operator, the rule application record information (including the number of times when a rule is applied and the state where a rule is applied) is updated, and further the selected rules are arranged in the table in the order of higher frequency information so as to be applicable in that order. Therefore, in the case of plural architecture candidates, it is possible to output the rules in the order of higher frequency or priority by the use of the candidate selection key, the above function being referred to as learning function. In the above description, only a single architecture rule having plural candidates is selected according to the operator request. However, it is possible to configure the system so as to create plural document architecture candidates to plural architecture rules. In this case, the applied rule table is formed with a condition name, applied flags, and the rule application record information being classified into each of rules. Further, as the method of operator's selecting of another logical architecture through the candidate selection key, it is possible to indicate a candidate sentence by a cursor or on the basis of various indications such as display reversal, high luminance under line, etc. Further, the document architecture can be classified according to other forms different from the shown in FIG. 6. Further, without being limited to the hierarchical classification of document headings, it is possible to apply the present invention to other data having a hierarchical architacture from the formal standpoint, such as an organization diagram, by modifying the heading decision rules and the document architecture decision rules, and data related thereto. Furthermore, in the above embodiment, the above system handles Japanese document. However, without being limited thereto, it is possible to apply the present invention to the system which handles other foreign language documents by modifying the decision rules and the architecture decision rules so as to correspond to the language. According to the present invention as described above, the input document data is segmented in units of sentences, and it is determined whether each sentence constitutes a heading. At the same time, the document rule of each sentence is decided. Therefore, the input document data can be effectively processed according to the logical document architectures. The document processing described above is performed effectively. In other words, the heading, the level (hierarchical level) of the heading in the document, and the document architecture, such as the paragraphs preceded by the heading, are obtained effectively according to the input document data. It is thus possible to process the document in units of chapters or sections according to the logical achitectures, thereby greatly improving and simplifying document processing. In addition, in the system according to the present invention, since the system can display some architecture candidates and the operator can select any one of them, it is unnecessary for the operator to amend the document architecture by using the editing functions. Further, since the frequency information is added to each of the selected architecture candidates for providing a learning function, it is possible to increase the possibility of deciding any desired document architecture. Further, when a higher priority value is previously set to the rule application record information of the rule to be applied, it is possible to first decide the most desired architecture candidate, in particular, in the case where the document has a specific document architecture as in a treatise. Therefore, the system according to the present invention can efficiently form a document architecture and therefore can enhance the document forming efficiency.
|
Same subclass Same class Consider this |
||||||||||
