Corporate disclosure and repository system utilizing inference synthesis as applied to a database6374270Abstract A corporate disclosure and repository system includes one or more software programs which execute on one or more general purpose data processing systems. The software programs include components for gathering information in the form of free form text documents, reducing the information to a formatted database, analyzing the contents of the database and reorganizing the database in a format suitable for drawing inferences with respect to the contents thereof and synthesizing inferences based upon the contents of the reorganized database. The software programs may be used both intracompany, in preparing documents for deposit in the repository system, and intercompany, in reviewing documents already deposited in the repository system. The intercompany part may be further divided into parts useful to regulators and parts useful to the public. The principle difference between the various parts is in certain knowledge and rules applied in the analysis and reorganization stages, since the various users of the system have different goals at those stages. Claims What is claimed is: Description BACKGROUND
TABLE 1
Feature Explanation
Text Search Searching by keywords and Boolean operation
Case Search Searching based on related relevant similar cases (free
text, plus user preference)
Template Allows for formatting free text (for quality check, analy-
Based sis and automatic inferencing)
Discover Based Allows for nonformatted open ended questions that are
also using data mining and inferencing techniques to dig
up the answer (based on the special matrix based organi-
zation of the data).
Browser It allows the user to see the data by a browser, such as
HTML.
Version Used for keeping tracks of the versions of the document
Control
Phase Control Used for keeping track and alerts according to the speci-
fied process for ready versions (e.g. draft by a certain
time to be sent to specified individuals etc.)
Security and Allows access to authorized people (also distinguish
Access between and write access)
Control
Signature Allows for electronic signature
Approval and Support the process of preparing the document
Release
Control
Linked Module Specific business, consolidation and filing rules
RULES (Customizable) for preparing the report
Linked Module Specific legal (e.g., FASB), accounting and explanatory
HELPS help
Linked Module Specific Process and Phases definition for preparing the
PROCESS disclosure
Linked Module Regulatory and Disclosure requirements updates
UPDATES
Reports Generating predefined and customizable reports and
Generator documents (e.g. for different types of companies)
Multi The application supports multi language capabilities
Language
Repurposing Allows for application generation in addition to the
Application disclosure documents
Related For a company with affiliates and related companies that
Companies share data
External Provide needed data for the documents (such as from
Input and Financial Portfolio Management Systems, and legacy
Interfaces systems, such as accounting).
ODBC Support connectivity to any database with ODBC Driver
The back end level includes the actual data repository of the database. The database system used herein should be appropriate for storing relations, objects including figures, maps, and digital images of documents, and free text. The intercompany part 303 also consists of the three levels. The first level is the repository level 317. The second and third levels are an engine level 319 and an application level 321. In this part, the second and third levels may be distributed among several sub-parts. For example, there may be engine 319 and application 321 levels for use by regulatory organizations, coexistent with separate engine 319 and application 321 levels for use by the public. The engine levels 307, 319 implement several software tools embodied in one or more software programs which are used to execute a method according to the invention. The various stages and steps next described may be performed for some purposes entirely by one engine level, or may be accomplished for other purposes by cooperation between multiple engine levels, as will be explained. As shown in FIG. 4, a transformation of information is performed by the present invention. Documents are introduced into a disclosure document text space 401. The information contained in the text space 401 is synthesized 403 into a structured format in formatted database space 405. Finally, analysis 407 produces information in a useful textual form. FIG. 5 expands the transformation illustrated in FIG. 4 into several distinct steps. Text is entered 501 and automatically formatted as a raw relational database 503. Extracts may be taken from the raw relational database in the form of free text abstracts 505. The extractions produce abstracts which are more useful than similar extractions made directly from the text entered at step 501 because the formatting of the database brings like kinds of information together in one place. Diagonalization produces one or more grouped databases 507 which may be statistically analyzed to produce a related companies analysis 509. Finally, a conclusion process synthesizes from the grouped database 507 an inferenced database 511 which is analyzed to form a set of questions/discoveries 513. The above-described transformation processes can occur at various points in the life cycle of a disclosure document. This life cycle is illustrated in FIG. 6. During Phase 1, data is entered 601, for example using a word processor, proofread 603 and reviewed 605. When the data entry is complete, the document is reviewed 607 and posted 609 to an intracompany repository during Phase 2. If the document is not yet required to be reported externally, then the temporary phases have not ended 611 and any previous step may be revisited, resulting in modifications to the document. However, when the temporary phases end 613, a permanent document is created during Phase 3. Final adjustments are made 615, the document is audited for completeness, correctness, etc. 617 and the document is finally frozen and reported 619. No further changes are permitted after reporting 619. When the document then enters Phase 4, it is archived 621 for some useful duration, after which it may be discarded 623. Although the document reported at step 619 may not be modified in a typical reporting system, there may be a Phase 5 of the document life cycle, during which the document may be annotated 625. Thus, explanations and corrections required, for example, by a regulatory agency may be entered. The generation of a reporting entity's database of documents to be reported to a regulatory agency and integration of the reporting entity's database with the regulatory agency's database of documents reported is now described. Free text is entered by the user at various stages in the life cycle of a reporting document. Several software tools are used to process the input. Generating the Raw Relational Database (RRD) First, the Expanded Conceptual Dictionary (ECD) is discussed. This dictionary relates five kinds of information, conceptually described here as five columns: the left hand side is the Concept, the middle two columns are Main Key and the Context, and the right hand two columns are the Instance, and Synonym. The Main Key and the Context are optional (i.e. do not necessarily exist for each Concept). Each Concept is unique. That is, there is a unique Concept in each row of the dictionary. The Concepts contained in the ECD are candidates to form columns defining a matrix data base design. Each Concept has associated therewith one or more Instances. The Instances are the kernel or root of the word (e.g. `do` for `does`, `did`, `done`), and could include Synonyms. A Concept is for example a "company`, and the Instances could include "IBM," with a Synonym of "International Business Machines." The Context for a "company" could be another Concept, such as "transaction." So unless there is an Instance of a "transaction," such as "joint venture", then "IBM" is not identified as a "company." Thus for example, "The blue giant", which can be defined as a synonym to "IBM," is understood as a "company" only in the Context of a "transaction" (or its Instances) in the sentence. Otherwise, for example "The blue giant" might be interpreted as a type of a whale. This allows for an Instance to belong to more than one Concept. When free text input is received, the text is first parsed on the basis of the contents of the ECD. Each paragraph of the free text input is parsed into words. Roots are identified for each of the words. Parsing identifies the Concepts represented by the words, by finding the word roots or words in the ECD amongst the Instances. Rows in a database matrix are formed for each word found in the ECD. Each row comprises a record, with the columns of each record, i.e. the database table definition being the Concepts. The database is constructed paragraph-by-paragraph, on the fly. The output database is known as the Raw Relational Database (RRD). From the RRD, text can be extracted representing the bare essence of the paragraph. This extraction is referred to as a free text abstract, as described above in connection with FIG. 5. The latter is used both for verification and for quality assurance, as described below. For example, in describing relation with a vendor, if the lead time is missing, it is easily identified. The abstract is generated from one or more predefined templates that define connection words between the Instances. Note that the dictionary is prepared in advance, i.e. before parsing input text, and can be updated to account for more and more Concepts. Since the domain of Concepts in which the dictionary is used is defined, this process asymptotically converges on a dictionary representing a high percentage of the Concepts encountered in the domain. Operation of the system upon a free text input in a predetermined domain e.g. business, legal or financial, is now described. Each of the software tools next described have a user interface for editing elements of the system. The software tools are used by the system while parsing free text. The system includes the software tools, but the system "learns" from the input processed and the content of databases or knowledge bases included in the software tools is enhanced as the system gains experience. The ECD consists of five conceptual columns noted above: the Concept, the Main Key, the Context, the Instances, and the Synonym. For each Concept (such as company), a Context is defined in which this Concept is used, and then the Instances and Synonyms are defined. The Context and Synonym are optional. For each Concept there could be many Instances. The Context is required in any case in which there is an Instance which belongs to more than one Concept. That is, if an Instance may refer to more than one Concept, then only one Concept may include that Instance without having a Context also associated with the Instance, and all the other appearances of that Instance must have a Context associated with them. The Concept with which the Instance is associated absent a Context is the Default Concept. Each Context is itself a unique Concept that has Instances of its own. The ECD has an option to include a Main Key level entry, which determines the design of the database. If it is not included, then the system has an internal, i.e. hard wired default. This field defines the Main Key of the database table. An event rules tool includes logic rules that define the relationship among the Instances of the Concepts. The hypothesis or trigger to the rule, i.e. the if part, is an Instance of an event that is identified in the first parsing step. For example, suppose an event rule hypothesis for the Concept "business action", exists. If the Concept "business action" is found then the rule parser looks for the then part, which specifies that there should be Instances of the Concepts "companies" and "technologies." If these Concepts are found then the rule is true, and a new record is added to the database, as explained later. An auxiliary knowledge software tool is built as a collection of rules that uses background knowledge to check validity, and quality, as well as automatically add information not found in the ECD to the database, based on built in rules. For example, if a legal action is described, then the auxiliary knowledge software tool checks whether the name of the law office is included. It may then add the address etc., for a particular name and also check that other desired parameters are present such as liability amount, expected time for resolution, etc. The rules used in the auxiliary knowledge software tool depend upon whether it is used in the intracompany part or the intercompany part. In an internal application, i.e. in the intracompany part, the rules reflect the format that the company wants to impose on their own reports. In the case of the repository system, i.e. the intercompany part, the rules reflect desired automatic analysis parameters. The following is a detailed description to be read in connection with FIG. 7, explaining the generating of the First Tier Database (FD) and Second Tier Database (SD). This is Stage 1 of the overall method. Generating the First Tier Database (FD) The input to the system is one or more report documents formatted as free form text. The first step is dividing the text into paragraph and word units to be parsed 701, and then identifying in each sentence the instances (words), which have a concept associated with them 703. Ties, which occur when an instance has associated with it more than one concept, are broken according to the context associated with the document being input 705. For example, in the business context the "Blue Giant" is IBM, and "Apple" is the computer company, while in other contexts, they may by whale type and fruit type, respectively. For each paragraph of the input document, a list of concepts is generated that form tables of the database. The database entries are the instances, i.e. the relevant words of the input document, related to concepts in the ECD. For the first paragraph the concepts form the table definition, and the entries are the instances, i.e. the words that existed in the free text of the input document. In the second paragraph, there will likely be some new and some existing entries. The new concepts are added, that is, a new table is generated, and filled with the entries of the instances. The tables are built on the fly keyed to a main key Concept having a Main Key value in the ECD, e.g. a company, a law suit, a person, etc. In case more than one Concept has a Main Key value, then the keys are organized into levels according to Main Key value, and the higher level Main Key is used. In case several Main Keys appear at the same highest level, the first appearance of the repeated Main Key is used. The result of this step is a relational data base 707. The design of this relational database is not efficient for the purpose of the present invention, yet, but the free form text document is transformed by this step into a formatted one. Generating the Second Tier Database (SD) The input to this step is the FD and the event rules (ER), which are combined to generate additional tables which become part of the resultant database 709. The tables of the FD carry an understanding of the original text. However, this understanding cannot be readily derived from the FD. For example in the FD there may be a table for each company reported on, containing the information that they had some new business development, for example developing flat screen technology. In the SD it will be revealed that this is a joint venture between two companies. Thus, the table key of the SD is different from that of the FD, and the rows all comply with certain structured rules. This additional structure imposed on the information provides the user with another level of understanding of the text. Where the understanding level in the FD was that the ASCII combination of the letters "IBM" means a company and all the related knowledge, now the idea of, for example the concept "business action" represented by the instance "joint venture" is understood as associating two or more companies (using the first level understanding) with each other and with some technology area. This understanding is made use of in Stage 3 of the method. Generating the Augmented FD/Augmented SD (AFD/ASD) Now Auxiliary Knowledge (AK) is used to augment the FD, the SD or both 711. For example, a pointer may be added to a database record containing information about a person linking the person's history to the record, or a pointer may be added to a database record containing information on a company project linking descriptions of related projects to the record. Validation and Verification The AK may also be used for validation and verification. For example, rules about what should be in a litigation description are used to validate that all the appropriate fields are present, and verify the current entries, such as address, phone numbers, etc. Such issues go beyond the conventional data type verification and domain verification inherent in a conventional database definition. The output of Stage 1 of the method is the RRD 713, which includes the FD, the SD, the AFD and the ASD. An example of the Stage 1 processing is now illustrated using the input sentence, "ABC and XYZ are forming a joint venture to develop flat screen technology." For this example, the contents of the ECD are given by Table 2.
TABLE 2
Concept Main Key Level Context Instances
Company 1 Business Action XYZ
ABC
IBM
Business Action joint venture
Technology flat screen
memory
The event rules used to transform the FD into the SD are given in Table 3, as follows.
TABLE 3
if Business Action then: The necessary parts are: Two or more com-
panies/divisions
The optional parts are: Technology
The sentence is parsed in accordance with the contents of the ECD to form the AF, as now given in Table 4.
TABLE 4
Record
Record Company Business Action Technology Associated With
1 XYZ joint venture flat screen 2
2 ABC joint venture flat screen 1
Next the contents of the FD of Table 4 are transformed into the SD by invoking any relevant event rules. The SD thus generated is given in Table 5.
TABLE 5
Record Action With
Number Type Company Company Action Technology
1 Business ABC XYZ form flat screen
joint venture
Background knowledge, such as references to regulations applicable to reported activities are applied here to validate and to augment the FD. A pointer to such regulations may be inserted in the relevant record of the FD and the SD to generate the AFD and the ASD. Diagonalization The result of stage 1, the RRD, is the input to stage 2. Because of the way it is constructed the RRD could be a very sparse matrix, with many different topics. The Stage 2 processing clusters similar topics. This is most useful for the repository database, but may also find utility at the company level, especially if the reporting entity is part of a large organization, with many affiliated companies. The processing of Stage 2 shuffles the rows and columns so that a dense diagonal matrix is created, the Grouped Database (GD). Stage 2 processing may be applied to all the different component databases of the RRD, i.e. SD, FD, ASD, and AFD. The GD is organized in such a manner as to enable the Stage 3 processing. The GD provides an additional benefit in that it is a more efficient way of saving the information contained in the databases that were created in Stage 1 processing. Stage 2 processing, illustrated in FIG. 8, proceeds as follows. First, for each two rows of the input database, a similarity index between them is calculated 801. The similarity index used is the number of columns for which both rows have an entry, i.e. in both rows the entry is not empty, divided by the total number of columns where either one of the two rows have an entry, i.e. in at least one row of the two rows the entry is not empty. This number is between zero, indicating no similarity at all, and one, indicating total similarity. Other similarity indices could be used, as will be understood by those skilled in this art. Next, the rows are placed into order of their similarity indices 803, from highest down to a predetermined minimum threshold 805. Below the threshold, the rows may be in any convenient order. The rows having similarity indices which exceed the threshold constitute a Group 807, also referred to as a block. The process is repeated 809 for the rows which are not yet part of a Group, until there are no more rows possible to arrange. In somewhat more detail, the method is performed as follows. Assume I(i,j) is the input matrix, and that O(i,j) is the output matrix, and further that i1. i2, . . . , in are the rows in the matrix (the records), and that j1,j2, . . . , jp are the columns in the matrix (the concepts-the data base table design). Thus I(i4,j5) for example is the entry corresponding to the fourth record and the fifth column of the input data matrix. S(il, ik) is the similarity index between the lth and the kth record. The number of instances that exist at the lth record is c(il), while c(jk) would be the number of instances for the kth concept that exist for all records. Thus, the similarity index between two records is defined to be S(il, ik)=c(il.andgate.ik)/c(il.orgate.ik). This index takes on values between 1, representing two records that deal with the same concepts and 0, representing two records which consider totally different concepts. Values falling between 0 and 1 show the relative similarity of two records, in the sense that the higher the index , the higher the similarity. The similarity index is not, however, necessarily linear. Using the above notation, the method is applied to a database as follows. For each two rows, compute the similarity index between them 801. For each row, sum all the similarity indices, as follows: S(il)= ##EQU1## The row l with the maximum value for S(il), I*, is defined by I*=arg(max[S(il).A-inverted.l=1, 2, . . . , n]) in step 802. By this definition, I* is the record l of maximum similarity to all other records. It has a similarity index value of Si*=max[S(il).A-inverted.l=1, 2, . . . , n]. Next, all the record numbers are arranged in an ordered descending list Li* according to the similarity of each record to I*, with ties broken arbitrarily in step 804. Thus, the list Li*=(I*k(1), . . . , I*k(n-1)), where k identifies the number of the original record, k.di-elect cons.{(1, 2, . . . , n), k.noteq.I*}, and the number in parenthesis is the place in the list of the associated record, also called the order number. For example Si*l(5).gtoreq.Si*k(6) means that the similarity of the lth record to I*, is greater than or equal to the similarity of the kth record to I*. A threshold T, for example, between 0.6 and 1.0 is now selected 805. The preferred value for T is 0.75. The value reflects the relative desire for similar data with respect to the size of the database. The list Li* is now trimmed at a cut-off point to form cut-off list Li*c such that similarity Si*q(n-1) of the last record q in the list is greater than or equal to T. Thus, Li*c=(I*k(1), . . . , I*k(m)), where the first m elements of Li* appear (m.ltoreq.n-1), and Si*k(m) is the similarity value of the last record in the Li* list that is with similarity index between I* and the last record greater than or equal to T. The rows of the database are then reordered such that I* occurs first, followed in order by the rows in the Li*c list 807. The records not in the list, follow in an arbitrary order. A similar process is now undertaken for the columns, i.e. concepts. The result is a dense matrix for the first m rows. The process is similar to that applied to the rows, i.e. records, above. Consider as input a matrix with m rows (records) and q columns, which is the number of columns for which at least one of the chosen m records has an instance. The order of the concepts, i.e. the columns in the m.times.q matrix, is rearranged using the method described above with respect to the rows. However, the similarity is considered between the column vectors of the matrix, rather than the rows. At this point, the result is a new ordered m.times.q matrix, which includes m records and q concepts. This matrix is termed a Group. The groups are numbered, the first one being called Group 1. Now the input matrix of n records and p concepts is reduced to eliminate Group 1 from further processing, so that the remainder of the matrix may be processed separately 808. Delete the m records and form a new input matrix with v=n-m records and r concepts, where r is derived from the original p concepts by deleting the concepts that do not have any instances in the remaining records. From the v.times.r matrix derive Group 2 in the same manner as above. Then recursively continue 810 until all records have been considered, and w blocks have been formed. There are now w matrices defining groups derived from the initial matrix, but each group is defined by a dense matrix dealing with similar concepts. The GD, which is the final result of this process, ideally has a diagonal structure, as seen in the example, below. Variations on the method of Stage 2 are possible, including for example conditional diagonalization. This applies when the groups should reflect the fact that a certain concept should exist. For example, if one wants to make an analysis of all the joint ventures among companies that file to the repository, then the similarity may be analyzed after filtering for that instance. Tables 6 and 7 represent an example of the input and output of Stage 2 processing, respectively. Each row is numbered and concepts are represented by the letters a-h. The presence of an entry in a row for a concept is denoted by an asterisk (*). The diagonalized result is shown for a predetermined threshold of T=0.6. The resulting groups are {3, 5, 7, 8}, {4, 9, 6}, {1, 10} and {2}.
TABLE 6
a b c d e f g h
1 * * *
2 * *
3 * * *
4 * * *
5 * * *
6 * * * * *
7 * * *
8 * * *
9 * * *
10 * * *
Conclusion Black Box Finally, in Stage 3 of the method, conclusions are reached and issues raised by the contents of the input document are identified. Identification of issues raised by the contents of a database is a particularly difficult problem. This is a very difficult task in any large database, especially when starting with a dynamic, data dependent free text input. However, the present invention makes such identifications as described below, after performing the two preparatory phases presented above and in view of a knowledge of the domain. The input to this stage is the w groups, i.e. blocks from the GD generated in the previous stage, and the output is the Inferenced Database. From a GD group, the software identifies rows having in common a missing column. This technique is especially effective when comparing such rows to a catalog. It is easy to observe changes in a pattern established over time by a specific company or a group of companies, For example, suppose the column for the Concept "inventory turnover" is missing. Although it is not necessary data, suppose further that it had been consistently reported in previous years. The question now arises, "Why is this data not reported this year?" This type of result is gathered with other, similar results, in the Questions Discoveries (QD) database. Another example is that if many companies disclose, the potential monetary consequence of litigation, but some do not, then this entry in the matrix will be empty for them, and thus will be easily flagged and identified as missing. In other cases, a column may be added over time. This may lead to a new automatic conclusion, for example, about special new relations between companies. The GD of the SD generates also the Company Analysis Report (CAR), based on the concepts that are filled. A tool useful in this stage and in the next stage is the Catalog. The Catalog defines for a disclosure area, what are the columns, i.e. fields that are required and what fields are optional. Thus, the disclosure can be checked for completeness, and the quality of the document can be evaluated. In an example of the use of the Catalog, the input may be the result of Stage 2 diagonalization, i.e. the GDs. The output is an Inference Database (ID) plus QD (Question Discovery) and CAR (Company Analysis Report). The Catalog defines a Database Design Template for a variety of issues, such as the required disclosure of legal issues and also desired optional issues. For example, the potential value "damage" for litigation matters may be defined as "not required." Verification and validation of a current report at either the intracompany level or the intercompany level may proceed as follows. If a concept is missing, or is sparse, in the current report and the Catalog indicates it to be either mandatory or optional, then it is included in the QD. If an instance is mostly missing from current reports that has appeared in previous years' reports, then that instance should be added to the QD. If the current report is compared to reports of a group of related companies, concepts common to the related companies but missing from the current report may be added to the QD. Finally, instances of concepts may be compared to benchmarks, leading indicators of industry averages. If the current report includes instances of a concept which are substantially different than the benchmark, leading indicator or average used, then that concept may be added to the QD. The process which performs verification and validation may have embedded therein to formulate specific questions in connection with the items added to the QD. The present invention has now been described in connection with a number of specific embodiments thereof. However, numerous modifications which are contemplated as falling within the scope of the present invention should now be apparent to those skilled in the art. Therefore, it is intended that the scope of the present invention be limited only by the scope of the claims appended hereto. ##SPC1## ##SPC2## ##SPC3## ##SPC4## ##SPC5## ##SPC6## ##SPC7## ##SPC8## ##SPC9## ##SPC10## ##SPC11## ##SPC12## ##SPC13## ##SPC14## ##SPC15## ##SPC16## ##SPC17## ##SPC18## ##SPC19## ##SPC20## ##SPC21## ##SPC22## ##SPC23## ##SPC24## ##SPC25## ##SPC26##
|
Same subclass Same class Consider this |
||||||||||
