Method and system for analyzing the logical structure of a document5669007Abstract An input document is matched with predetermined patterns on a line-by-line basis, whereby it can be assigned a plurality of pairs of attributes and costs. When the process for the whole document is completed, in accordance with a rule specifying the combination of attributes between the adjacent lines, the nodes of a graph are generated, the nodes are linked with each other, and costs are given to the node and links. There is a plurality of paths for traveling the graph from the root node to the final node, and each of them means the interpretation of a possible logical structure of the document. By summing the costs for the traveled nodes and links, a total cost value can be associated with each path, and by prioritizing by this total cost value, a plurality of logical structure interpretations can be sequentially shown from the most plausible path (logical structure interpretation). A chosen logical structure is tagged as required. Claims I claim: Description FIELD OF THE INVENTION
______________________________________
*Chapter[0-9]+
(HEADER 1 CS3)
*[0-9)+. [0-9]+
(HEADER 2 CS3) (OLIST 2 CS5)
. (ULIST ? CS3)
.sctn. *[0-9]+ *[ 0-9] (HEADER 1 CS3)
.sctn. *[0-9]+ *. [0-9)+ *[ 0-9]
(HEADER 2 CS3)
* ( *[0-9]+ *) (OLIST 1 CS3)
* ( *[0-9]+. [-9]+ *) (OLIST 2 CS3)
*[0-9]+ *) (OLIST 1 CS3)
*[0-9]+. [ 0-9]
(HEADER 1 CS3) (OLIST 1 CS5)
*[0-9]+. [0-9]+. [0-9]+. ? +
(HEADER 3 CS3)
(OLIST 3 CS5)
*[0-9]+-[0-9]+ +[ -]
(HEADER 2 CS5)
(OLIST 2 CS3)
*[0-9]+-[0-9]+-[0-9]+ +[ -]
(HEADER 3 CS5) (OLIST 3 CS3)
*[A-z]+. *[ A-z]
(HEADER 1 CS7) (OLIST 1 CS3)
* ( *[A-z] *)
(OLIST 1 CS3)
*[A-z] *) (OLIST 1 CS3)
* ( *[-] *) (OLIST 1 CS3)
*[-] *) (OLIST 1 CS3)
*. (ULIST ? CS3)
*. (ULIST ? CS3)
*- (ULIST ? CS3)
______________________________________
EXAMPLE 1 Dictionary Entries These indicate that, if there is a character string that matches the regular expression of interest in a line, that line has any of the sets (of attribute level cost) (called labels) listed on the right side. In the regular expression in example 1, : head of the line, *: repetition more than 0 times of the preceding character, [-]: a character whose character code lies between the characters specified on both sides of--(including both characters), and +: repetition more than one time of the preceding character. Accordingly, for instance, *Chapter[0-9] indicates that there are 0 or more blanks after the beginning of the line, followed by the characters "Chapter" and one or more characters whose character code lies between 0 and 9 (including 0 and 9), namely numerals. In the set (attribute level cost) of example 1, for instance, the following attributes are shown: HEADER: header OLIST: the first line of an ordered list item ULIST: the first line of an unordered list item Further, although not shown in example 1, there are other attributes, such as those that follow: LCONT: continuation line of a list item PBEGIN: the first line of a paragraph PEND: the last line of a paragraph ORDINARY: ordinary line (not a header, nor the first or last of a paragraph, nor part of a list) Level is a numeral which shows the depth of the paragraph and chapter for HEADER, and the depth of the list for others. Sometimes, depth may be undefined (it can not be determined from the character string and it is determined from the context in a later stage), and, in this case, the level of the dictionary entry is set to "?". Cost is represented by a character string beginning with CS, and the numeral following CS is the value of the cost. Particularly regarding *[0-9]+. [0-9]* (HEADER 2 CS3) (OLIST 2 CS5), it should be noted that two labels, "HEADER 2 CS3)" and "(OLIST 2 CS5)", are assigned there. If the assignment of more than one labels is allowed, ambiguity occurs in the logical structure interpretation, but the present invention is characterized by positively taking advantage of such ambiguity to rank and display more than one possible candidates of the logical structure of a document in descending order. In this case, the assigned cost gives an index which numerically represents the plausibility of the logical structure obtained through analysis. That is, the greater the assigned cost is, the lower the validity or plausibility is. Thus, if the two labels of "(HEADER 2 CS3)" and "(OLIST 2 CS5)" are assigned, it means that the system determines that the validity of being a header is greater than the validity of being the first line of a list item. In addition, the value of a cost is determined statistically or manually by examining a number of sample sentences beforehand. The first line in example 1 means that the line beginning with a character string such as "Chapter 1" or "Chapter 12" is a header of level 1 and the cost for that is 3. Further, the second line means that the line is a header of two levels or the beginning of a list item (ordered) if there is "1.2" or "3.3," and the cost for them is 3 and 5. Similarly, from the description in the third line, .cndot. is the first candidate of list items (unordered), and the level is indefinite, in other words, it can occur in any level with a cost of 3. A dictionary retrieval program 412 reads a dictionary file 416 into main memory and converts it to a finite-state automaton, and thereafter, by matching the character string in the four elements (line) of the line data 414 from the beginning on a character-by-character basis, it retrieves the regular expression in the character string. (This retrieval method is a standard one for the matching of a regular expression and character string.) If the character string contains a substring corresponding to the regular expression, the label corresponding to the regular expression (the label written to the right of the regular expression in the dictionary) is assigned to that line. At this stage, more than one labels can be attached to one line. Which of the labels is valid may depend on other characteristics of the line, for instance, Indentation (e.g.: the first line of a paragraph is indented, etc.), font size (e.g.: the header is printed in large characters, etc.) or the like, or some combination of them, or may depend on the combination with the characteristics of the preceding and next lines. These are solved in the later stages of graph creation and analysis. In addition, the elements which are undefined at the stage of consulting the dictionary, such as the level of an unordered list, are also determined later on. For instance, the following sentence is considered:
Table 1
______________________________________
Left Line
Character string end length Font
______________________________________
Chapter 1 Document Image Analysis
0 36 2 (1)
Information extracted in the document
2 60 1 (2)
image analysis . . .
1.1 Text 4 9 1 (3)
Character code only 10 19 1 (4)
1.2 Layout information 4 23 1 (5)
Extracted from printed image
10 28 1 (6)
1.2 and 1.3 are different
10 28 1 (7)
1.3 Logical structure 4 22 1 (8)
Conversion into tags of structure
10 56 1 (9)
description . . .
. GML 12 4 0.8 (10)
. Latex 12 7 0.7 (11)
is processed. 10 13 1 (12)
Many systems were devised to extract this
4 55 1 (13)
information, for instance, the following;
4 28 1 (14)
Defaultindent = 4
Defaultlength = 60
Defaultsize = 1
______________________________________
EXAMPLE 2 Example of Text In the above example, Defaultindent represents the default left end, Defaultlength represents the default line length (right end minus left end), and Defaultsize represents the default font size. Based on with the dictionary in example 1, the following possibility exists:
TABLE 2
______________________________________
(1) : (HEADER 1 CS3)
(3),(5),(7),(8)
: (HEADER 2 CS3) or (OLIST 2 CS5)
(10),(11) : (ULIST ? CS3)
______________________________________
However, on the one hand, (3), (5) and (8) are logically presumed to be the beginning of a list item in view of the fact that: the indent is larger than the text, and a pattern of the same type which has a continuous number appears at a relatively near position (*). On the other hand, (7) happened to be of the same type as (3), (5) and (8), but it should be part of a list item beginning from (5), or the continuation of a list. The reason for this is that the indent of (7) is not aligned with the preceding (3) and (5) regardless of being of the same type as they are, and thus it is not likely to be the beginning of a list item of the same level. A list may be nested but, in that case, the number should start from one. Even if it is assumed to be the header of level 2, it is not logical that the number does not start from one. Accordingly, it is inappropriate that this line is determined to be the starting line of a list item or a header. Further, since (10) and (11) are items in a list item starting from (8), its level should be 2. In addition, (4), (6), (9), and (12) are continuation lines of a list, but discrimination between these and ordinary lines cannot be determined only from the lines themselves. In addition, (12) is at the same level as (9) in the above figure, if the indent is as shown in:
TABLE 3
______________________________________
.cndot.
LaTeX (11')
is processed.
(12')
______________________________________
it should be determined that (12') is at the same level as (11'), and for
TABLE 4
______________________________________
.cndot.
LaTex (11")
(macro of TeX)
(12")
______________________________________
The indent is the same as (11) and (12), but it should be determined from the content that (12") is at the same level as (11"). Such determinations cannot be made if various characteristics of the lines appearing before and after the line of interest such as an attribute, length, indent, and whether or not it is a block boundary are not considered. It is difficult to determine uniquely from any single characteristic, and no determination may be performed without sophisticated processing related to the content of the document, as in (11") and (12"). In addition, various styles of printed documents exist and, if it is desired to deal with a wide range of documents, one-to-one correspondence cannot be provided between characteristics and attributes. Further, determination based on a ambiguous determination criterion such as shown by (*) above may be needed. As described above, to determine the plausibility of each candidate of attributes, the relationships between the attributes of the preceding and next lines must be considered. In addition, the validity is not definitely determined but has ambiguity. To order the possible interpretations according to their plausibility, a document is represented in the graph creating section by a directed graph whose nodes are the respective ones of attributes of each line and whose links are the adjacency of lines. The respective paths reaching from the starting point to the ending point of the graph are the respective possibilities of the logical structures of a document. In FIG. 4, a graph creating section 308 is a processing module for creating an analysis graph 424 and consists of a grammar file 420 stored on the hard disk 404 and a grammar interpreter 418 resident processes in main memory which interpretes the grammar stored in file 420 and the graph 424. The input to the graph creating section 308 is a set of <line, attribute label>. Here, line is a line in line data 414, and attribute label is the one determined in the format analyzing section 306 from a keyword in the line. When a line does not contain any of the regular expressions given in the dictionary, a label of (ORDINARY ? CS3) is assigned. The graph 424 is a directed graph in which a cost is added to each node and link and it is stored on the hard disk 404 in a predetermined data structure. For the link, a cost is attached determined by the relationship between the characteristics of the line of interest and the preceding and next lines thereof. The cost attached to the node is provided by the characteristics of the line of interest and the connection relationship thereof. (For instance, if a Header label is attached, the cost (CS3) described in the dictionary is used if no other condition is specifically given but, if the right margin is smaller than a predetermined value, the possibility of being text becomes high and thus the cost is increased.) In addition, any label which cannot exist logically at the point of the interlabel connection relationship is excluded at this stage. Further, as in (7), described above, considering the possibility of a line (ordinary) which by chance has the pattern to become a keyword, also for a line labeled in the dictionary, a node labeled an ordinary line is added if it satisfies a predetermined condition. The connection condition as described above or the cost adjustment for the connection is described in the file 420 on the hard disk 404, using a type 3 grammar as shown in example 3, below. EXAMPLE 3
______________________________________
[Numeric expression 1]
______________________________________
HEADER: START
PEND
if(fontsize > Defaultsize) Cost-;
if(indent > Defaultindent) Cost+;
if(rightmargin<Threshold) Cost+;
ADD(HEADER,Level,N,Cost,2);
if(period) ADD(ORDINARY,?,3,2);
}
OLIST: ORDINARY
PBEGIN
{
if(fontsize > Defaultsize) Cost+;
if(indent > Defaultindent) Cost-;
ADD(OLIST,Level,cost,2);
ADD(ORDINARY,?,6,2)
}
OLIST: OLIST
ULIST
LCONT
{
if(fontsize > Defaultsize) Cost+;
if(indent > Defaultindent) Cost-;
ADD(OLIST,Level,Cost,2);
ADD(ORDINARY,?,6,3)
ADD(CONT,?,6,2);
}
ULIST: ORDINARY
PBEGIN
LCONT
{
if(fontsize > Defaultsize) Cost+;
if(indent > Defaultindent) Cost-;
ADD(ULIST,Level,Cost,2);
}
ORDINARY: PEND
HEADER
{
if(font>Defaultsize & indent<Defaultindent)
ADD(HEADER,?,5,2);
else if(indent>Defaultindent) ADD(PBEGIN,?,3,2);
else ADD(PBEGIN,?,5,2);
}
ORDINARY: PBEGIN
ORDINARY
{
ADD(ORDINARY,?,3,3);
if(period)
if(length<Defaultlength)ADD(PEND,?,2, 2);
else ADD(PEND, ?, 3, 2);
}
ORDINARY: OLIST
ULIST
LCONT
{
ADD(ORDINARY,?,3,3);
ADD(LCONT,?, 3,2)
}
END: PEND
{
ADD(END,?,0,0)
}
In the above example 3, the expression
P: Q1
Q2
Q3
{
}
______________________________________
indicates that a node having an attribute P is connected to a node having attributes of Q1, Q2, or Q3 and the operation in {} is performed. Cost represents the cost described in the dictionary, Level represents a level described in the dictionary, and Threshold represents a constant representing a threshold value. In example 3, the statement Cost+ or Cost- indicates that the value of Cost is incremented or decremented by a predetermined unit. As the predetermined unit, 2 is set in this example, but 1 or another value may be selected as needed. ADD(class,level,node.sub.-- cost,link.sub.-- cost) indicates that a node having an attribute level of class and a cost of node cost is connected, and this cost is assigned to a link of link.sub.-- cost. Thus, the resultant graph has costs for both node and link. There should be one or more ADD sentences in {} of the rule. The first rule of example 3 above indicates that the HEADER line can appear next to the last line of a paragraph (that is, a link can be spanned from a node having the last attribute of the paragraph to a node having the HEADER attribute), and, in this step, if that line is printed with a font larger than the default size, the cost is decreased (it is determined that the possibility of being a HEADER is high), and, if that line is indented or its right margin is smaller than a fixed value, the cost is increased, and, if there is a period at the end of the line, it has the attribute of an ordinary line and a node of an indefinite level and cost 3 is also added. The second rule indicates that the OLIST line can appear next to an ordinary line or the first line of a paragraph, and, in this step, unlike for the HEADER line, the cost is increased for a large font and decreased for indent, and a node of an ordinary line of cost 6 is always added. The third rule indicates that, if the OLIST line appears next to the first line of a list (both ordered and unordered) or the continuation line of a list, the node next to the continuation line of the list is also added in addition to the above possibility, assuming that it is the continuation line of the list. In the fourth rule, the ULIST line can appear next to an ordinary line, the first line of a paragraph, the first line of a list, or the continuation line of a list, and, in this step, the cost is increased for a large font and decreased for indent, and unlike for the OLIST line, it is assumed that there is no possibility of being an ordinary line. The fifth to seventh rules indicate how to deal with ordinary lines. The fifth rule indicates that, if an ordinary line appears next to the last line of a paragraph or a header, this line is recognized to be a header and the attribute is changed to HEADER if the indent is smaller than the default one (is to the left) and the font is large; otherwise it is recognized to be the beginning of a paragraph and the attribute is changed to PBEGIN. In the sixth rule, if an ordinary line exists in a paragraph and there is a period at the end of the line, it is deemed to definitely be the end of the paragraph, if it is short, and the node of PEND is added with a cost smaller than ORDINARY; if it is not short, it is considered that it may or may not be end of the paragraph, and the node of PEND is added with the same cost as the node of ORDINARY. Further, the seventh rule indicates that, if an ordinary line appears within a list, it is considered that it may be the continuation line of the list. The grammar interpreter section 418 applies a grammar 420 to the input. Then, with reference to the flowcharts in FIGS. 5 to 8, the process in the grammar interpreter section 418 is described. FIG. 5 shows the general flow of the process in the grammar interpreter section 418. First, in step 502, (.phi., START, 0, 0), which becomes the root node of a graph, is generated. In step 504, the content of the grammar file 420 is read from the hard disk 404. In step 506, data containing <line, attribute, level, cost>, which corresponds to the first line of a text, is read in main memory. What is indicated as a line includes values such as character string (T.sub.n), indent (I.sub.n), line length (L.sub.n), and font size (F.sub.n) shows in FIG. 4 as line data 414, and these are used in step 512. In step 506, the data related to all of the <attributes, levels, costs> of this line are read. "All" is mentioned here because, in accordance with the present invention, a plurality of <attributes, levels, costs> can be given to one line. Then, in step 508, it is determined whether the end of data, or the data corresponding to the last line, has already been read. If so, in step 510, a node becoming the end of the graph is connected by ADD(END, 0, 0, 0), completing the process. If, in step 508, it is determined to not be the end of data, the flow in step 512 proceeds to the "grammar and line data matching" process described in detail in FIG. 6 and, after completion of the process, the flow returns to step 506 where data containing <line, attribute, level, cost> corresponding to the next line of the text is read. Then, referring to FIG. 6, the process referred to as step 512 in FIG. 5 is described. In FIG. 6, 1 is stored in an integer variable n in step 602. In step 604, it is determined whether the n-th rule exists. If it does not exist, it means that a determination has been made as to the applicability of all of the rules, for instance, shown in example 3 above, the process of FIG. 6 is completed. If the n-th rule exists, the flow proceeds to step 606 where it is determined whether the attribute (or attributes) of the current line is equal to the n-th rule P.sub.n. If it is determined that the attribute (or attributes) of the preceding line is equal to the n-th rule P.sub.n, the flow proceeds to step 608 detailed in FIG. 7 and then to step 610. Otherwise, since the n-th rule cannot apply, the flow immediately goes to step 610. In step 610, n is incremented by one, whereby in step 604, the determination is made for the rule next to the rule determined in the previous cycle. Now, referring to FIG. 7, a description is made of the process referred to as step 608 in FIG. 6. In FIG. 7, 1 is stored in an integer variable k in step 702. In step 704, it is determined whether k>n.sub.max, where n.sub.max is the number of Qs of the n-th rule to which attention is currently paid. For instance, in a rule such as:
______________________________________
[Numeric expression 2)
______________________________________
ORDINARY: OLIST
ULIST
LCONT
{
ADD(ORDINARY,?,3,3);
ADD(LCONT,?, 3,2);
}
______________________________________
n.sub.max =3. k>n.sub.max means that all of the Qs of this rule have been examined, and thus the process of FIG. 7 is completed. If k>n.sub.max does not hold, the flow goes to step 706, where it is determined for the preceding line whether a node N whose attribute is Q.sub.nk exists. Q.sub.nk means the k-th Q of the n-th rule. If such a node N is found, the flow proceeds to step 708 where the cost for the attribute (limited to one in step 606 in FIG. 6 though it may be plural) of the current line is stored in a variable called cost, the indent value of the current line is stored in a variable called indent, the line length of the current line is stored in a variable called length, and the font size of the current line is stored in a variable called fontsize. Although there are other variables such as rightmargin, they are omitted for the convenience of the description. These variable values, as seen from the rule.
______________________________________
[Numeric expression 3]
______________________________________
OLIST: ORDINARY
PBEGIN
{
if(fontsize > Defaultsize) cost+;
if(indent > Defaultindent) Cost-;
ADD(OLIST,Level,Cost,2);
ADD(ORDINARY,?,6,2);
}
______________________________________
are used to increase or decrease the cost in {} of the rule. In step 710, the character string of the current line is scanned and, if it is determined that there is a period at the end of it, the logical variable called period is changed to yes (in C language, an integer 1 is substituted because it has originally no logical variable). The variable called period is used in the rule as follows, for instance:
______________________________________
[Numeric expression 4]
______________________________________
ORDINARY: PBEGIN
ORDINARY
ADD(ORDINARY,?,3,3);
if(period)
if(length<Defaultlength)ADD(PEND,?, 2, 2);
else ADD(PEND, ?, 3, 2);
}
______________________________________
In step 712, the cost of the node N is increased or decreased according to the description in {} in the rule. In step 714, a new node is generated in accordance with the ADD sentence in {} in the rule and connected after the node N. The process in step 714 is described in more detail with reference to FIG. 8. After step 714, k is incremented by one in step 716 and the process returns to step 704. Then, referring to FIG. 8, the process shown as step 714 in FIG. 7 is described in more detail. This process shows the process of ADD(class, level, node.sub.-- cost, link.sub.-- cost) which is used in the above described rule. In step 802 of FIG. 8, a new node N.sub.1 is generated. If implemented in C language, this is performed by allocation of the region for the size of the structure of one node from main memory. In step 804, class is substituted for the attribute of the newly generated node N.sub.1, level is substituted for the level and node cost is substituted for the cost. In step 806, a link is formed from the node N determined by the attribute determination in step 706 in FIG. 7 to the just generated node N.sub.1, and the cost of a link.sub.-- cost value is assigned to the link. In addition, the above described values of class, level, node.sub.-- cost, and link.sub.-- cost are provided as the argument for the ADD function of the rule which is currently applied. It is to be noted that, as shown in example 3 described above, each of a plurality of rules always has at least one ADD function, and, thus, at least one new node is generated, whichever rule is applied. After such a process is carried out by the graph creating section 308, graph structure data as shown by reference number 414 in FIG. 4 is created, and, in this embodiment, it is temporarily stored on the hard disk 404 as shown in FIG. 4, though it may be only stored in main memory. To show a more specific example, a graph such as that in FIG. 9 is created from the above sample sentence. The possibility of (3), (5), (7), (8) being a header is excluded at this stage (because HEADER never follows PGEGIN in the above grammar). In addition, to which list item the continuation line of list is directed is indefinite at this stage, and this is solved in the next graph search stage. The present invention considers that the graph 424 represents candidates for logical structure interpretation. That is, each of the respective paths from the start to the end represents one logical structure interpretation. In this case, since the cost is independently assigned to each node and link by the processes described in FIGS. 5 to 8, the most appropriate logical structure interpretation can be obtained if a search is made for the path having the smallest cost among the respective paths. However, in the graph 424, creation and cost assignment are performed only by the connection relationships before and after the particular node corresponding to the current line. For instance, assuming that the current line is a header (the beginning of a list item), the connection relationship with the preceding header (the beginning of a list item) is not considered. For instance, line (3) of example 3 was assigned the attribute label corresponding to the beginning of a list item of level 2 in the previous stages (text format analyzing section and graph creation section). However, since there has been no list item in the same paragraph until that time, the level should be 1. In addition, line (5) is labeled OLIST, ORDINARY, and LCONT because of its format and grammar. However, from comparison with the beginning ((3)) of the preceding list item, the labels LCONT (this means that (5) is part of the list item beginning with (3)) and ORDINARY (this means that (5) is another line, that is, the list beginning with (3) ends only in (3), (4) and the next list item belongs to another list) can be excluded. (This is also applied to (8).) Further, since (7) has an indent which is to the right of (3) and (5), the possibility of being the beginning of a list item of the same level is low, and the possibility of being ORDINARY or LCONT is high. (The possibility of the list being nested and (7) being a header of level 2 is low because the number does not start with 1.) From indent, it is appropriate that it is determined to be the continuation line of a list item starting from (5). Furthermore, in the graph 424, the level of the continuation line of a list is undefined, and it is determined from comparison with the start line of the preceding list item for indent. Example 4 is shown as heuristics for performing these cost adjustments. A cost adjustment is carried out for the cost for a link. The reason for this is that, if there are different paths even for the same node (this means that the preceding and subsequent lines are different), there may be a case in which the cost should be made different. For this, in the analysis graph search section 310 (FIG. 4), the attribute value associated with each label (in this example, level) is held at a node and propagated to the subsequent nodes in the search stage to perform a pruning or reevaluation of the cost. By this, the description of the logical structure of a document is not limited to the range of the regular grammar, but can be equivalent to context-free grammar or higher descriptive power. The input to the graph search section 310 is the data of the file 424 of a graph stored on the hard disk 404 and expressed by a predetermined data structure, and the output thereof is a prioritized graph 426. That is, on the prioritized graph 426, data representing a path starting from the start and reaching the end is ordered and expressed in an ascending order of the sum of the costs associated with the nodes land links on the path. Each path data contained in such a prioritized graph 426 can easily be converted to a document with tags such as BookMaster or Tex by inputting it to an appropriate text processing program. The search in the graph search section 310 is performed not only for the optimum one but also for all of those whose cost difference between the optimum path is within a predetermined value or those within the N-th order (N is a predetermined integer larger than 1). This allows one or more interpretations to be temporarily output for the part for which ambiguity still remains. As one way which enables such search, for instance, there is an approach which Itoh and Maruyama use in morphological analysis and the postprocessing of OCR (Itoh, Maruyama: Error detection and automatic correction of OCR-input Japanese sentences, Proc. of Information Processing Society, Vol. 33, No. 5, pp. 664-670, 1992). Also, a similar description is found in the Patent Application LaidOpen No. 5-46590 official gazette. These are characterized by combination of the graph search method of Dijkstra and beam search. As a result of the above described analysis graph creation and path search, document logical structures can be obtained in the descending order of their plausibility. The prioritized graph created from the graph of FIG. 9 is shown in FIG. 10. This shows that by applying heuristics existing within the graph search section 310 and as shown in the example 4 below, the cost assigned to the links is modified, and the modified costs are calculated for each path of the graph and exhibited from lower ones to the second candidate. EXAMPLE 4 Heuristics for Cost Modification For a node having the attribute level of HEADER on the particular path, if there is no HEADER of the same format before it, the cost is increased if the number is not 1. The level is set to 1 regardless of the level written in the dictionary. If its format (written in the dictionary) and level are the same as the preceding HEADER and the number (such as "1" of Chapter 1) continues, the level is made equal to the preceding level. In this case, the cost is not modified. If the level is the same and the number does not continue, it is made equal to the preceding level, but the cost is increased. If the level written in the dictionary is higher than the preceding HEADER, the cost is increased if the number is not 1. If the level is lower than the preceding HEADER, the path is in traced reverse to search for a HEADER node of the same format and same level. If there is no such node, the cost is increased. If there is, whether the number continues is checked and, if not, the cost is increased. For a node having an attribute label of OLIST on the particular path, if the same paragraph does not contain OLIST AND ULIST before the particular node, the cost is increased if the number is not 1. The level is set to 1 regardless of the level written in the dictionary. If the format (written in the dictionary) and level are the same as the preceding OLIST and the number continues, the level is set to the same as the preceding level. In this case, if the indent is the same as the preceding OLIST, the cost is not changed. If the level is the same and the number continues or the indent is not in harmony, then the level is set to the same as the preceding level, but the cost is increased. If the level written in the dictionary is higher than the preceding OLIST, the cost is increased if the number is not 1. Further, at this point, if the indent is more to the left than the preceding OLIST, the cost is increased. If the level is lower than the preceding OLIST, the path is reversely traced to search a OLIST node of the same level. If there is no such node, the cost is increased. If there is, whether the number continues is examined and, if not, the cost is increased. Further, at this point, if the indent is not the same as OLIST of the same level, the cost is increased. If the same paragraph contains no OLIST of the same format before the particular node, but there is a different OLIST or ULIST, if the indent is more to the right than the preceding OLIST or ULIST, the level is assumed to be deeper and set to the level of the preceding OLIST or ULIST plus one. At this point, if the number is not 1, the cost is increased. If the indent is more to the right than the preceding OLIST or ULIST, the path is traced in reverse to search for a OLIST node of the same level. If there is no such node, the cost is increased. If there is, whether the number continues is checked and, if not, the cost is increased. Further, at this point, if the indent is not the same as the found OLIST of the same level, the cost is increased. For a node having an attribute label of ULIST on the particular path, If the same paragraph contains no ULIST or OLIST before the particular node, the level is set to 1. If there is only OLIST, the level is set to the level of the preceding OLIST plus one. At this point, if the indent is more to the left than that OLIST, the cost is increased. If there is ULIST and the format is the same as that, the level is made to equal to the preceding level. If the format is different from the preceding ULIST, the path is traced in reverse to search for a ULIST node of the same format. If there is no such node, the cost is increased if the indent is more to the left than the preceding ULIST. The level is set to the level of the preceding ULIST or OLIST plus one. If there is such a node, the cost is increased if the indent is not the same as the found ULIST of the same format. For a node having an attribute of ORDINARY on the particular path, if the preceding node is OLIST, ULIST, or LCONT, the cost is increased if the indent is more to the right than or the same as that line. For a node having an attribute label of LCONT on the particular path, if the attribute of the preceding node is LCONT and the indent is the same, the level is made equal to that node. Otherwise, the path is traced in reverse to search for a OLIST or ULIST node of the same paragraph. The level is made to equal to OLIST or ULIST nearest to the particular node the indent of which is the same as or more to the left than the particular node. If there is no such OLIST or ULIST (that is, the indent is more to the left than any OLIST or ULIST of the same paragraph), the cost is increased, with the level being set to 1. In addition, in accordance with the heuristics of the above example 4, (12") is assumed to be the same level as (11"). Further, to tag the path obtained in FIG. 10 is easy, and, basically, tagging can be made only by the attribute and level of a node. For instance, if a node having the attribute of HEADER is encountered along a selected path, a tag corresponding to the header is attached to the beginning of the line associated with that node, which is output to the hard disk 404. Then, if a node having an attribute of OLIST is encountered, a tag indicating an ordered list is attached to it, and the line associated with the node is output. Then, if a node having the attribute of LCONT is encountered, a tag indicating continuation of an ordered list is attached to it, and the line associated with the node is output. Further, if a node having an attribute of ORDINARY is encountered, a tag indicating the end of an ordered list is output and a carriage return is performed, and then the line associated with the node is output. In addition, the level is required when a return is made from a line of the attribute of list to a line of the ordinary attribute and the like. For instance, a return may be made directly from a list of level 2 to a sentence of an ordinary line and, in this case, if level 2, it is needed to output two tags indicating the end of a list. In this way, when the final node is reached, tagging to all of the sentences is completed. The present invention produces the following advantages: 1. Not only the logical structure of a document can be described by a grammar with attribute values, but also the characteristics such as whether an indent is large or small, and differences in front size, which are difficult to describe with the framework of the grammar, can be comprehensively reflected by increasing or decreasing the cost associated with the particular transition. 2. By interpretation of the logical structure by a graph search problem, and enabling the search up to N-th, more than one candidate can be output in the order of the plausibility of their interpretation. 3. To detect a portion including ambiguity is more easy, and thus it is also possible to warn the user. Accordingly, the present invention satisfies the requirements (A) to (C) described in "Means for solving the problems," and has proved that it runs at a practically sufficient speed (in the order of 100 lines/s on a personal computer with a CPU of i486 (trademark of Intel) of 33 MHz). Examples of actual input and output are shown in FIGS. 11 and 12, respectively. However, to express a logical structure, tagging is performed using the system of IBM BookMaster (User's Guide for IBM BookMaster, N:SC34-5009, (1989)). Further, an analysis of two chapters of manuals and ten articles shows that 86% of the tagging in the manuals was correct and 82% of the tagging was correct in the articles for the first candidate. If the second candidate is also allowed, the proportion of correct tags increases to 89%. This exhibits the advantage that a plurality of solutions can be presented.
|
Same subclass Same class Consider this |
||||||||||
