Method and apparatus for proofreading a document using a computer system which detects inconsistencies in style6125377Abstract A computer system for proofreading a document in electronic form performs the steps of: identifying elements of the document; interpreting elements of the document; creating known element objects; linking related known elements; and comparing known elements and linked objects to identify inconsistencies in the document. Further, the computer system for proofreading a document has an output device for supplying an identification of the identified inconsistencies. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
______________________________________
F = Bolding (2)
Underlining (2)
Italicization (1)
C = All Capitals (10)
Initial Capitals (4)
J = Centered (1)
______________________________________
IF: F+C+J.gtoreq.7 THEN: TITLE ELSE: IF: F+C+J.ltoreq.3 THEN: END ELSE: ASK USER In the context of the aforementioned example, the following would result: Simple Rules Base: If: P=Page 1 L.sub.text <6 words AND F=Bolding OR Underlining OR Italicization AND C=All Capitals OR J=Centered THEN: Text=Title As the word: "Document" lacks the necessary Capitalization attribute and Justification attribute, it would not be considered a title under the Simple Rules Base. Interpreted by the more complex Weighted Rules Base, the following results: Weighted Rules Base Tier 1 Since: P=Page 1 L.sub.text <6 words THEN: Go To Tier 2 Tier 2 Since
______________________________________
F = Bolding = 2
Underlining = 2
C = Initial Capitals = 4
______________________________________
Since F+C+J=8>7 THEN. TITLE Otherwise stated, as: F+C+J=8, the Text is Interpreted as a Title Whether weighted or not, if this rules-based analysis suffices to determine the nature of this text, the nature of the Unknown Element is determined. The attributes associated with the now-interpreted Known Element are accumulated for the purpose of establishing standard styles per type of Element. At this point, the third component step, the Creation of Known Elements, is commenced. 3. Heuristic-Based Interpretation Where the nature of a unit of text is ambiguous in that the text complies only in part with the rules, or in that the attributes fail to indicate which rules group to apply, the rules based analysis may not suffice to enable the determination of the nature of a piece of text. In this case, heuristic processing is applied. Heuristic processing involves the interpretation of the text under analysis in the context of the Global Style and other significant styles determined in the fourth step, following, rather than the Interpretation Rules Base. The interpretation of text in light of the significant styles and Global Style extracted for each Known Element enhances the efficiency of the invention and the ability thereof to resolve ambiguity without resort to further user input. As noted below, significant styles and a Global Style may be learned not only on a per-document basis, but also on an overall user basis. Styles used by the user in other documents may form the basis for interpreting ambiguous information in the document currently under analysis. In the absence of heuristic processing, the invention would only be able to ensure accurate interpretation of this ambiguously-attributed unit of text by requesting further user input. Heuristic processing enables the invention to interpret certain ambiguous text without further resort to the user in several situations. The significance of heuristic processing is made clear by the following discussion of two examples. The first example below relates to a case where the text complies only in part with the rules. The second example below relates to a case where the text's attributes fail to indicate which rules group to apply. The first example presents a situation similar to the above, where the document title, however, is ambiguous as it possesses only the following format attributes: Title Document According to either the Simple Rules Base or the Weighted Rules Base, each described above, this word would not be interpreted as a title. Assume that the Significant Style Counter described in the fourth step below extracted the user significant style described by the following rule: Heuristic If: P=Page 1 L.sub.text <6 words AND C=Initial Capitals AND Other Sub-Titles are found to have neither: Bolding Underlining Italicizing THEN: Text=Title Assume the following interpretation by the heuristic: P=Page 1 L.sub.text <6 words C=Initial Capitals Other Sub-Titles are found to have no Bolding, Underlining nor Italicizing. As a result, this text is interpreted to be a title. The second case, in which the attributes fail to indicate which rules group to apply, is illustrated by the following example from the sample document provided earlier in this application: Document 1. Sub-Title 1. The text reads as follows: (a) this is the first section, (b) this is the second section including the following sub-sections (i) the first sub-section, (ii) the second sub-section and (iii) the third sub-section, and (c) this is the third sub-section. 2. Sub-Title 2. This text is ambiguously numbered (a) the first section and (b) the second section (i) a section with an ambiguous relationship to the preceding numbering series and (ii) a second entry in that numbering series. 3. Sub-Title 3. This text is numbered (a) the first section and (b) the second section (i) a new, unseen numbering style, (ii) second in that series and (iii) third in that series, and (c) this is the third sub-section, including a cross-reference, as stated in Section 1(a). Following a review of these 3 numbered paragraphs, five things become clear: 1. There are 2 styles of numbering the first at the beginning of the paragraph and the second embedded within the paragraph ("Embedded Numeration"). 2. There are several possible instances of Embedded Numeration series within any paragraph. 3. The Embedded Numeration may be either embedded in increasingly lower levels or embedded in parallel to other instances of numeration. 4. Paragraphs 1 and 3 are consistent in their numbering styles, in that both use two Embedded Numeration stvles: (a, b, c) (i, ii, iii), and further in that the second numbering style (i, ii, iii) is embedded in a lower level than the first numbering style (a, b, c). 5. Paragraph 2 is ambiguous in its numbering style, in that the first Embedded Numeration style (a, b) does not clearly embed the second Embedded Numeration style (i, ii). It is therefore not clear whether sections (i, ii) are embedded within section (b) or whether they are parallel to numbering series (a, b). Assume that, through the Significant Style Counter, the invention learns the following user Global Style, including the learned relationship among different series of numeration:
______________________________________
Global Style:
First Embedded Numeration Style: a, b, c
Second Embedded Numeration Style: i, ii, iii
AND
Second Embedded Numeration is further embedded in First
Embedded Numeration
______________________________________
In the absence of this learned Global Style and relationship, the Interpretation Rules Base is unable to deduce the relationship between the two series of Embedded Numeration and can only ensure accuracy of interpretation by requesting further user input. However, heuristic processing expands the efficiency of the invention, as it enables the invention to make determinations without further user input. In particular, as above, where the numbering styles have been used unambilguously in other locations in the document, the invention may apply the relationship learned from these other locations to the current ambiguous location. In the sample paragraph above, even though the numbering of paragraph 2 is ambiguous when considered on its own, the learned Global Style helps resolve the ambiguity. To further illustrate the power and efficiency gains enabled by this heuristic processing, consider the following sample paragraph once again: 2. Sub-Title 2. This text is ambiguously numbered (a) the first section and (b) the second section (i) a section with an ambiguous relationship to the preceding numbering series and (ii) a second entry in that numbering series. Further, assume that, through the Significant Style Counter, the invention learns the following user Global Style, immediately below, and the Additional Significant Style, further below, including the learned relationships among different series of numeration:
______________________________________
Global Style:
First Embedded Numeration Style: a, b, c
Second Embedded Numeration Style: x, y, z
AND
Second Embedded Numeration is further embedded in First
Embedded Numeration
______________________________________
Embedded Numeration In the absence of these learned styles, the Interpretation Rules Base is unable to deduce the relationship between the two series of Embedded Numeration and can only ensure accuracy of interpretation by requesting further user input. While heuristic processing expands the efficiency of the invention, as it enables the invention to make certain determinations without further user input, it does not assist in determination of the sample paragraph above, as the Global Style differs from that used in this paragraph. Absent any additional heuristic processing, the invention, again, can only ensure accuracy of interpretation by requesting further user input Assume, however, that in addition to the Global Style noted above, the Significant Style Counter learns that the user has an additional significant style, though it appears less than the Global Style. The further power enabled by the learned Additional Significant Style, which follows, further allows the invention to make accurate determinations without requiring user input.
______________________________________
Additional Significant
First Embedded Numeration Style: a, b, c
Style: Second Embedded Numeration Style: i, ii, iii
AND
Second Embedded Numeration is further
embedded in First Embedded Numeration
______________________________________
As noted above, paragraph 2 is ambiguously numbered. Neither the Interpretation Rules Base nor the Global Style heuristic processing is sufficient to resolve this ambiguity. However, the learned Additional Significant Style further expands the efficiency of the invention, as it enables the invention to make determinations without further user input. In particular, as above, where the numbering style has been used both ambiguously and inconsistently with other locations in the document, relationships learned by the Significant Style Counter may help determine the interpretation of ambiguous information. 4. User-Guided Interpretation If this heuristic processing suffices to determine the nature of this text, the nature of the Unknown Element is determined and the third component step, the Creation of Known Elements, is commenced. Where neither the rules base nor heuristics suffice to guide the invention to determine the nature of the text under analysis, it requests additional input from the user to enable it to make a determination as to the nature of the Unknown Element. Upon the user's enabling input of the nature of the text, the nature of the Unknown Element is determined and the third component step, the Creation of Known Elements, is commenced. D. Document De-Construction/Known Element Object Construction The third step involves the Creation of Known Element objects. Following the aforementioned determination of the nature of the Unknown Element, the third step, in which the Unknown Element is converted into one of a variety of known types of objects, termed a Known Element, takes place. The types of Known Elements into which the Unknown Elements are converted are detailed below. Essentially, the result of this third step is the analysis and deconstruction of the document into its structural components, detailed below This analysis enables the invention to take its next step, the analysis of the distinctive, user styles and formats associated with each of these components and the abstraction and establishment of the document's overriding styles and formats. The basis for the determination of these styles and formats are those attributes gathered in the Unknown Element Parsing for each Known Element. 1. Structural Component Elements Consider again the example of the title interpreted through the simple Interpretation Rules Base, in the second step above: "Document" The attributes associated with the Known Element, Document Title, are therefore as follows: P=Page 1 L.sub.text <4 words F=Bolding, Underlining C=Initial Capitals J=Centered The structural components into which the document is broken down include the following overall document parts: Title Page; Table of Contents and other document Indexes; and the Body of the Document. The Body of the Document is further deconstructed into the following sub-components of the document: Document Title; Sections and Sub-Sections, whether numbered, bulleted or plain text; The relationship among Document Tiers, including Nested Sections and Sub-Sections; Section Sub-Titles; Numeration and Bulleting; and Structural Punctuation and Conjunctions. To further illustrate this deconstruction, consider the sample paragraph 2 as above, including the following Global Style: Document 2. Sub-Title 2. This text is ambiguously numbered (a) the first section and (b) the second section (i) a section with an ambiguous relationship to the preceding numbering series and (ii) a second entry in that numbering series.
______________________________________
Global Style:
First Embedded Numeration Style: a, b, c
Second Embedded Numeration Style: i, ii, iii
AND
Second Embedded Numeration is further embedded in First
Embedded Numeration
______________________________________
The following Known Elements characterize paragraph 2. Known Element No. 1:
______________________________________
Sections: <2. Sub-Title 2. This text is ambiguously numbered
. . . that numbering series.>
Section Attributes:
Sub-Titled.
Spacing: 0
Justification: Left
Length: 39 Words
Global Style:
First Embedded Numeration Style: a, b, c
Second Embedded Numeration Style: i, ii, iii
AND
Second Embedded Numeration is further embedded
in First Embedded Numeration
Numbering Style:
<Arabic Numeral>
Structural Punctuation: <.>
______________________________________
Known Element No.
______________________________________
Section Sub-Title:
Sub-Title 2.
Sub-Title Attributes:
F = Underlining
C = Initial Capitals
Structural Punctuation: <.>
Structural Conjunction: None.
______________________________________
Known Element No. 3 and
______________________________________
Sub-Sections: <Sub-Section 1: (a) the first section and>
<Sub-Section 2: (b) the second section (i) a
section . . . that numbering series.>
Sub-Section Attributes:
No Sub-Title.
Numbering Style: <a, b, c>
Structural Punctuation: None.
Structural Conjunction: <and>
______________________________________
Known Elements No. 5 and
______________________________________
Sub-Sub-Sections:
<Sub-Sub-Section 1: (i) a section with an
ambiguous . . . series and>
<Sub-Sub-Section 2: (ii) a second entry in
that numbering series. >
Sub-Sub-Section Attributes:
No Sub-Titles
Numbering Style: <i, ii, iii>
Structural Punctuation: None.
Structural Conjunction: <and>
______________________________________
2. Functional Component Elements Certain functional components of the document are also identified and extracted through the application of a further rules base, the "Functional component Rules Base." These include acronyms and defined terms, along with their associated definitions. In addition, the invention extracts other functional components that serve to inter-link structural components of the document, including cross-references to Sections and Sub-sections, as well as to other documents. Consider the sample cross-reference included in the sample Section 3, above: 3. Sub-Title 3. This text is numbered (a) the first section and (b) the second section (i) a new, unseen numbering style, (ii) second in that series and (iii) third in that series, and (c) this is the third sub-section, including a cross-reference, as stated in Section 1(a), (italics inserted) The italicized phrase has significance apart from its simple lexical significance, as it functions to inter-link two portions of the document. As a result of this inter-linkage, Section 3 now has an expanded semantic meaning; it now includes not only the terms expressly included in Section 3, but also the terms of Section 1(a), incorporated by reference. This significance, termed "functional significance" makes the italicized phrase an additional, important structural component of the document. E. Learning User Styles The fourth step involves the establishment of standard styles. This establishment is accomplished through the Significant Style Counter, which increments a counter for each type of Known Element in order to determine the standard attributes associated with it. As several different sets of attributes may be used throughout a document in association with any one type of Known Element, several parallel counts may be kept for each type of Known Element. Further, as each type of Known Element may exist in different Document Tiers, the varying style used for each type of Known Element in each different Document Tier is further tracked separately. Additionally distinct counts may be kept for Known Elements appearing in different positions or sub-components in a document. In any such case, where this incrementation results in a count higher than a threshold for a signficant style, a candidate for standard style is generated for the Known Element in the particular Document Tier and document structural component in which the Known Element appears. These steps are oriented not to finding definitive Element-wide styles (i.e., for the particular Document Tier and position), but rather only candidate significant styles. These candidates are significant in two ways. First, they are deserving of further analysis to determine whether they are the over-riding Element-wide style, termed the "Global Style," for the particular Known Element. Second, whether or not they are the Global Style, each candidate style is, by definition, a significant user style, evidenced by its exceeding the threshold number of occurrences for significance, O.sub.min ; as a result, any of these significant styles may have been used by the user on elements of the same type and can be used to resolve ambiguities in Unknown Element Interpretation, as above. The candidates for "Global Style" must be further analyzed to determine which of these styles qualifies as the "Global Style" to which all other numbering styles must ultimately conform. Where the number of occurrences of significant style x, N.sub.x, exceeds a minimum N.sub.min, that candidate style is designated the "Global Style." The remainder of the styles may be designated Additional Significant Styles. F. Linkage The fifth step involves the linkage of related Known Elements. A further rules base, the "Linkage Rules Base," instructs the invention which Known Elements to link and where to link them. Linkages are generally inserted in order to account for three types of relationship. The first, "parent-child," exists where one element is the source element referenced by a referencing functional component. Linkage of the referencing element to the source element concretizes the implicitly referencing relationship among the two elements; an electronic link is inserted to reflect the conceptual link indicated by the reference. "Parent-Child" links include the links of a defined term to its definition; a section reference to its referenced section; a cross-document reference to the other document. The second, "sibling-sibling," exists where two elements are of the same type. In this case, linkage reflects not a relationship of referencing, but rather a relationship of commonality of certain attributes. Two Known Elements of the same type share certain significant attributes and the linkage reflects that. "Sibling-sibling" links include the links of one section to another; of one sub-title to another; of one defined term to another; of one cross-reference to another; etc. The third, "twin," is a sub-set of the "sibling-sibling" relationship. It exists where two elements are not only of the same type, but include the identical text. In this case, not only certain attributes, but rather all attributes are shared. "Twin" links include the links of one instance of a defined term to another instance of the same defined term; of one cross-reference to a section to a second cross-reference to the same section; etc. In any of these cases, the insertion of these links allows several things, including comparison of Known Elements to one another, as in the sixth step below, as well as movement directly among linked Known Elements. G. Proofreading The sixth step is Comparison of Known Elements, a further rules based process. These steps, based on the "Proofreading Rules Base," guide the invention as to where and in what ways to compare Known Elements. Essentially, this Rules Base examines Known Elements for several things. First, the Proofreading Rules Base seeks the presence or absence of linkages among "parent-child" Elements; all child elements must be linked to their parent elements. In addition, certain parent elements must be linked to a child element, e.g. a definition must be linked to a usage of a defined term, while other parent elements may validly exist without linkage to a child element, e g. a section need not be referenced to be valid. If these links are absent, an error exists. Second, the Proofreading Rules Base seeks the presence of common attributes among both sibling-sibling" elements and "twin" elements. As their relationship is defined by the commonality of certain attributes, the absence of some or all of these common features is an error. Third, the Proofreading Rules Base seeks duplication or substantive inconsistency among "twin" parent elements. Where these inconsistencies exist, an error may exist, though the duplication or inconsistency may be intentional. To resolve these inconsistencies, the invention may make its own, rules-based determination or may request user input. Upon the completion of this process, the invention has deconstructed the document into its structural components, has located any functional components that conceptually link different elements of the document, and has linked all related elements. In addition, it has proofread the document, noting inconsistencies among linked elements, enabling any inconsistencies to be remedied either by application of a further rules base or by user instruction. This information may be used to enable automatic replacement of inconsistent styles with consistent styles, to prevent the insertion of substantively inconsistent components, to enable conforming of changes between Known Elements of a given type, and to enable presentation of indexes of Known Elements to the user. Each of these uses of the Known Element information relies on further "rules bases" to instruct the invention when and how to replace or prevent insertion of inconsistent styles, when to conform changes and how to organize the information for presentation to the user, respectively.
|
Same subclass Same class Consider this |
||||||||||
