Document processing using heading rules storage and retrieval system for generating documents with hierarchical logical architectures4813010Abstract A document generating system which includes an input device for inputting document data, a heading candidate extraction section for extracting, as a heading candidate, the word corresponding to the heading stored in a heading dictionary from the document data input at the input section. A heading decision section is implemented in the system for checking the heading candiate, that was extracted by the heading candidate extraction section according to a heading rule stored in a heading rule dictionary and for deciding whether the heading candidate is a heading. Also, a document architecture decision section is implemented in the system for checking the heading decided by the heading decision section according to document architecture rules stored in a document architecture rule dictionary. The architecture decision section is for determining the heading as satisfying the document architecture rule as a true heading, and the heading not satisfying the document architecture rule as being false heading. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
Rules for Heading
______________________________________
Condition 1 does not include a reserved
word.
Condition 1-1 includes a heading word.
Condition 1-1-1 includes a reserved heading
word.
Condition 1-1-1-1
does not include a chapter
heading in the subsequent
part.
(Result) indicates a chapter heading.
.fwdarw.A symbol portion, an alpha-
numeric portion, a punctuation
portion, or a tail symbol is
defined as a heading pattern.
Condition 1 + does not include a reserved
Condition 1-1, heading word.
Condition 1-1-2
Condition 1-1-2-2
indicates that a chapter head-
ing is present in the subsequent
part.
Condition 1-1-2-2-1
indicates that matching with
a chapter heading pattern is
successful.
(Result) indicates a chapter heading.
.fwdarw.The order of the chapter
heading pattern is incremented
by one.
Condition 1 + this heading pattern does not
Condition 1-1 + match the previous chapter
Condition 1-1-2 +
heading.
Condition 1-1-2-2,
Condition 1-1-2-2-2
Condition 1-1-2-2
an itemized pattern is not
present in the subsequent part.
(Result) indicates an itemized pattern
candidate.
Condition 1 + indicates that an itemiz-
Condition 1-1 + ed pattern candidate is
Condition 1-1-2 +
present the subsequent
Condition 1-1-2-2 +
part.
Condition 1-1-2-2-2,
Condition 1-1-2-2-2-2
Condition 1-1-2-2-2-2-1
this heading pattern matches
the previous itemized pattern
candidate.
(Result) indicates an itemized pattern.
The order of the itemized pat-
tern is incremented by one.
______________________________________
The logical architecture of a sentence or word, as determined by decision section 9 according to the above rules, is stored in logical architecture storage 10. Display controller 4 controls display 5 to display the document data according to the document logical architecture stored in logical architecture storage 10. The operation of the document processing apparatus will now be described with reference to the flow chart in FIG. 5. When document data is input at input device 2 (step a), input document data is sequentially stored in document storage 3. At the same time, the input document data is segmented into a plurality of blocks by document processor 1, as shown in FIG. 2. In segmentation processing, a line return code and a space code or segmentation symbol such as ". . . ", ";", ",", or ":" are determined as segmentation codes. The input document data is segmented in units of blocks at the segmentation codes. In this case, the segmentation sentence length is measured by counting characters. If the measured value falls within a predetermined value (e.g., 40 characters), the sentence is determined as having the possibility of being a heading sentence. If the segmented sentence is determined as having the possibility of being a heading sentence according to the measured number of characters, heading extractor 6 decides whether a character string (words, phrases, or symbols) constituting the segmented sentence is registered in heading dictionary 7 (step b). For example, when sentence "1. Introduction" in the input document data is extracted, it is checked as to whether it is registered in heading dictionary 7. In this case, "1" and "Introduction" are retrieved from heading dictionary 7, and the sentence is determined as being heading candidate A (step c). When heading candidate decision is performed, heading decision section 8 accesses heading rule dictionary 8a to determine whether candidate A is a heading word (step d). If candidate A is defined by any one of the rules in FIGS. 4A to 4D, candidate A is determined as being heading word B (step e). In this case, the type of heading word is determined according to the applied heading rule. If the sentence segmented by document processor 1 does not correspond to any heading word registered in dictionary 7, or if the segmented sentence does not coincide with any heading rule although it is determined as being a heading candidate word, the segmented sentence is determined as being a sentence rule not included in the heading word rules (step f). The sentence determined as being a heading word, and the sentence determined as not being a heading word are input to document architecture decision section 9 in order to determine their document architecture. When the document architectures is determined, decision section 9 determines whether the sentence architectures correspond to document architecture rules (Tables 1 to 4) stored in rule dictionary 9a (step g). If the architecture of the input document is defined by one of the document architecture rules, document architecture data corresponding to the determined rule is stored in storage 10 (steps h and i). However, if not, an error is present in the document architecture. As a result, error processing for the sentence, for example, correction of document data, is performed (step j). In the segmented words in FIG. 2, the sentence of the first line, i.e., "document understanding system", and the sentence of the second line, i.e., "Okawa Taro" are not stored in dictionary 7. These sentences are decided by extractor 6 not to be heading words. However, the sentence of the first line is defined by a rule representing a noun phrase appearing at the head of the document, and decision section 9 decides that "document understanding system" is a title. The sentence of the second line, "Okawa Taro" is, a proper noun representing a person's name. Since the person's name follows the title, the name is determined as being an author's name. The results obtained by the document architecture decision described above are stored in the form, as shown in FIG. 6A, in logical architecture storage 10. In the sentence of the third line, i.e., "1. Foreword", three words, i.e., "1", ".", and "Introduction" are stored in dictionary 7. Therefore, this sentence is determined as being heading candidate sentence Al. At the same time, the categories constituting this sentence are recognized as a numeric portion, a tail portion, and a heading candidate word, respectively. Heading decision section 8 accesses rule dictionary 8a to determine whether the sentence determined as being heading candidate A1 is defined by the heading rules. In this case, the order of the categories constituting candidate word A1 is analyzed. Decision section 8 determines whether the order satisfies any one of the conditions in FIGS. 4A to 4D. The first numeral "1" is defined by the rule in FIG. 4D. The numeral "1" and tail portion "." are defined by the rule in FIG. 4B. Therefore, "1." is determined as being a heading symbol according to the rule in FIG. 4B. "Introduction" is defined by the rule in FIG. 4C, and is determined as being a heading word. The relationship between the heading symbol and the heading word is defined by the rule in FIG. 4A. Heading candidate A1 is thus decided as heading B1. The above decision process is shown in FIG. 8A. In the above decision process, if the categories are not defined by the rules in FIGS. 4A to 4D, heading candidate A1 is determined as not being a heading word. Document architecture decision section 9 determines the document architecture of heading B1 according to the rules in Tables 1 to 4. In this case, the logical architecture of the analyzed sentence is stored in storage 10, as shown in FIG. 6B. No chapter heading is indicated in the stored logical architectures. Heading B1, i.e., "1. Introduction" is defined by conditions (1), (1-1), (1-1-1), and (1-1-1-1) in Table 1 so that "1. Introduction" is determined constituting chapter heading C1. According to this decision, the logical architecture containing the chapter heading is stored in logical architecture storage 10, as shown in FIG. 6B. Since the number of characters of the sentence of the fourth and fifth lines exceeds the number for determining the possibility of a sentence being a heading word, this sentence is therefore determined as being other than a heading. As defined by the rule in Table 3, the sentence of the fourth and fifth lines is determined as being a sentence constituting a paragraph. The sentence of the sixth line is recognized as heading candidate A2 in the same procedures as for heading candidate A1. In this case, the sentence of the sixth line is analyzed by the steps in FIG. 7B and is determined as being heading B2. The chapter heading data in FIG. 6B is stored in storage 10. Heading B2 is compared with the rules in Table 2 to determine it coincides with a specific one of the rules. Heading B2 is defined by conditions (1-1), (2-1), (3-1), and (4-1), and is determined as having the possibility of being of the same level as that of chapter heading C1. In this way, it is determined whether heading B2 is defined by the rules in Table 1. In other words, "2. Features of System" satisfies conditions (1), (1-1), (1-1-2), and (1-1-2-2-1), and thus, heading word B2 is determined as constituting chapter heading C2. The resultant logical architecture data is stored in storage 10 as shown in FIG. 6C. The same processing as described above is performed for the sentences of the seventh and subsequent lines, and the document architectures of these sentences are stored in storage 10, as shown in FIGS. 6D and 6E. More specifically, for the sentence of the seventh line, heading candidate A3 is analyzed, as shown in FIG. 7C, and then is determined as being heading B3 according to the rules in FIGS. 4A to 4D. In document architecture decision section 9 heading B3 is compared with the rules in Table 2. Since the pattern of heading B2 does not previously appear, matching is unsuccessful. As a result, heading B3 is determined as being a heading having a level different from those of the previous headings. Heading B3 is checked according to document architecture rules in Table 1 and is found to coincide with conditions (1), (1-1), (1-1-2), (1-1-2-2), and (1-1-2-2-2-1). Therefore, heading B3 is determined as being itemizing heading C3. Similarly, since the sentence of the eighth line satisfies conditions (1-1), (2-1), (3-1), and (4-1), the level of the heading corresponding to the sentence of the eighth line is determined as being possibly the same as that of the itemized heading. The sentence of the eighth line is determined as satisfying conditions (1), (1-1), (1-1-2), (1-1-2-2), (1-1-2-2-2), (1-1-2-2-2-2), and (1-1-2-2-2-2-1) in Table 1. If a paragraph is detected, the correspondence between the paragraph and the level of the heading must be determined. By referring to the rule in Table 4, the conjunctional relationship between the paragraph and the heading is determined so that the paragraph level can be decided. If the sentence of the fourth line is input, the sentence immediately preceding the paragraph is a chapter heading and satisfies conditions (1-1) and (2-1) in Table 4. The level of this sentence is set to be the same (level 1) as the chapter heading, as shown in FIG. 6C. For the sentence of the ninth line, a paragraph is found next to the itemized heading. The sentence of the ninth line satisfies conditions (1-1) and (2-2) in Table 4. Therefore, its level is set, as shown in FIG. 6E. According to the present invention as described above, the input document data is segmented in units of sentences, and it is determined whether each sentence constitutes a heading. At the same time, the document rule of each sentence is decided. Therefore, the input document data can be effectively processed according to the logical document architectures. The document processing described above is performed effectively. In other words, the heading, the level (hierarchical level) of the heading in the document, and the document architecture, such as the paragraphs preceded by the heading, are obtained effectively according to the input document data. It is thus possible to process the document in units of chapters or sections according to the logical architectures thereby greatly improving and simplifying document processing. FIG. 8 shows a block diagram of the heading extractor 6. In this circuit, document data is sequentially input from original document storage 3 to comparator 81 in response to a readout position signal from readout position designation section 80. In this case, the position of each character in the document is stored in address memory 82. Comparator 81 compares input sentence data with a line return code stored in register 83. The input sentence data is input to line buffer 19 and stored in line buffer 84 in response to a position signal from readout position control section 80. The number of the characters stored in line buffer 84 is counted by counter 85 in response to the signal from control section 80. If a line return code appears in the character data, the count of counter 85 is compared, by comparator 87, with a value stored in register 86. Data representing a predetermined one-line sentence, for example, 40 characters, is stored in register 86. If the count of counter 85 coincides with the reference value of register 86, an output from comparator 87 is input to decision section 88. If the output from comparator 87 is supplied together with the line return code from comparator 81 to decision section 20, or the line return code is supplied thereto prior to the output from comparator 87, decision section 20 determines the input sentence data as being a heading candidate. Decision section 88 supplies a signal based on this decision to line buffer 84. Line buffer 84 stores the input sentence data as heading candidate data in heading candidate data memory 89, in response to the decision signal from decision section 88. In this case, the count of counter 85 is subtracted by operation section 23 from the content of address memory 82. The resultant difference derives a start address of line buffer 84. The start address is input to heading candidate data start position memory 91. In this case, line buffer 84 and counter 85 are reset to the initial state. Therefore, heading extractor 6 is set to the initial state so as to extract the next sentence. If a coincidence between the count of counter 85 and the reference value of register 86 occurs prior to generation of the line return code, the data stored in line buffer 84 is determined as not being a heading word. Line buffer 84 and counter 85 are reset to the initial state. In this state, the extraction operation for the next input sentence data can be performed. If document data has been completely processed during the heading extraction routine, the heading decision routine is executed by heading decision section 8. This routine is performed by a circuit in FIG. 9. The heading candidate data from heading candidate data memory 89 is input to comparator 92 and is sequentially compared with the decision rules stored in heading rule dictionary 8a. The comparison results of comparator 92 are sequentially input to decision section 93. Decision section 93 determines conditions for defining heading candidate data as true heading words. If decision section 93 determines that heading candidate data satisfies all conditions, heading rule application control section 94 determines that the candidate data is a true heading word. By this decision, heading candidate data memory 89 reads out the candidate word data decided as the true heading word and supplies it to heading data cell-shaping section 96. Shaping section 96 forms a heading cell every heading decision. If numerals or letters are determined as being a heading, the numerals are ordered according to their values, or the letters are ordered alphabetically. For example, if a heading symbol is "Chapter 2", its ordinariness is 2. The data cell obtained in the manner described above is stored in heading data cell memory 97. According to the present invention, the logical architectures such as chapters, sections, items, and paragraphs are obtained from the input document data. Therefore, document processing can be performed effectively according to these logical architectures.
|
Same subclass Same class Consider this |
||||||||||
