Integrated system and method of data warehousing and delivery5781911Abstract A system and method is capable of providing a much more effective enterprise-wide decision support system. It consists of an integrated end-to-end solution including automatic generation of data warehouses or data marts integrated with automatic delivery of information from the data warehouses or data marts to knowledge workers throughout the enterprise using a "publish and subscribe" paradigm for dissemination of the information at any time, to any places, in any formats to any number of knowledge workers. This integration allows information in the data warehouses or data marts to be delivered immediately after every refresh of the data warehouses or data marts thereby allowing maximum utilization of the valuable information in the data warehouses or data marts throughout the enterprise to gain the most optimum decision support. Claims What is claimed is: Description BACKGROUND
______________________________________
Table Name Column Name Data Type
______________________________________
EMPLOYEE Employee.sub.-- Number
EBCDIC
Name EBCDIC
Address EBCDIC
Sex EBCDIC
SKILL Employee.sub.-- Number
EBCDIC
Skill EBCDIC
______________________________________
Let us say that these definitions are to be used as the corresponding table definitions in the PERSONNEL Data Warehouse or Data Mart except that all the data types are to be transformed from EBCDIC to ASCII. This transformation is specified by defining a tasklist to transform from EBCDIC to ASCII for each column of the tables. The definition of the third table of PERSONNEL is derived from the Payroll database, say, PAY table whose definition is as follows:
______________________________________
Table Name Column Name Data Type
______________________________________
PAY Employee.sub.-- Number
ASCII
Pay.sub.-- Amount
ASCII
Pay.sub.-- Month
ASCII
______________________________________
Let us say that this definition is accepted as the definition of the corresponding third table of PERSONNEL without any change. Next, the EMPLOYEE table is defined as having its Employee.sub.-- Number column as a primary key while the SKILL and PAY tables are defined as having their Employee.sub.-- Number column as a foreign key referencing the EMPLOYEE table. These definitions are then stored in the Information Catalog. These three tables then form the tables of the Data Warehouse or Data Mart. The definitions obtained so far are then stored in the Information Catalog. After this, a Conceptual Business View of the three tables is then defined which can be of the following form:
______________________________________
EMPLOYEE class Joins or Navigation Path
______________________________________
<About EMPLOYEE>
Nil
Employee.sub.-- Number
Nil
Name Nil
Address Nil
Sex Nil
<EMPLOYEE has SKILL>
EMPLOYEE.Employee.sub.-- Number =
SKILL.Employee.sub.-- Number
Skill EMPLOYEE.Employee.sub.-- Number =
SKILL.Employee.sub.-- Number
<EMPLOYEE has PAY>
EMPLOYEE.Employee.sub.-- Number =
PAY.Employee.sub.-- Number
Pay.sub.-- Amount
EMPLOYEE.Employee.sub.-- Number =
PAY.Employee.sub.-- Number
Pay.sub.-- Month
EMPLOYEE.Employee.sub.-- Number =
PAY.Employee.sub.-- Number
______________________________________
*This business view together with the joins or navigation path
This business view together with the joins or navigation path between the three tables to obtain this view are then stored in the Information Catalog. Next, by way of example, two Extraction Orders, say E1 and E2 are defined. E1 is an order to extract data from the PERSONNEL database and has the following dependency list and tasklist with the workflow of the tasklists of these two Extraction Order shown in FIG. 4: Dependency List: Nil Tasklist: E1.1 Extract the data of the EMPLOYEE table from the Personnel database. E1.2 Perform the EBCDIC transformation on each column of the extracted data of the EMPLOYEE table. E1.3 Transport the extracted and transformed data of EMPLOYEE table to a Staging Area called, say EMP.sub.-- DATA. E1.4 Extract the data of the SKILL table from the same Personnel database. E1.5 Perform the EBCDIC transformation on each column of the extracted data of the SKILL table. E1.6 Transport the extracted and transformed data of SKILL table to the EMP.sub.-- DATA Staging Area. E2 is an order to extract data from the Payroll database and has the following dependency list and tasklist, as depicted in FIG. 4b: Dependency List: Tasklist: E2.1 Extract the data of the PAY table from the Payroll database. E2.2 Transport the extracted data of PAY table to the EMP.sub.-- DATA Staging Area. As both these Extraction Orders are for different databases they can be executed concurrently and independently of each other. The Extraction Orders also contain the time for starting the execution of the tasklist, say midnight every night. Publisher Module This module is used by someone who is typically a business analyst within an organization. Using the publisher module he first retrieves a Conceptual Business View from the Information Catalog. He may enhance this view by defining new columns or new tables to add to the Data Warehouse or Data Mart. Each of the new column or table defined is a transformation operation or a set of transformation operations which together form a tasklist. This enhanced Conceptual Business View is then stored in the Information Catalog. Next, the publisher uses the enhanced Conceptual Business View to determine what should be published. A way he does this is to eliminate certain items from the enhanced Conceptual Business View. Another way could be to specify filters that limit only certain records of data of the enhanced Conceptual Business View to be retrieved by subscribers. The result of this item elimination or filtering is a Published Business View which is then stored in the Information Catalog. After this the publisher defines a Publishing Order which is an order to retrieve the extracted and transformed data in the Staging Area, perform any transformations that he has specified in defining the enhanced Conceptual Business View, and then load the resultant data into the Data Warehouse or Data Mart. As in the case of the Extraction Order created by the person using the Supplier Module, this Publishing Order is a tasklist which may include a set of timings as to when the execution of the tasklist is to start. Let us continue with the earlier example to illustrate this process. The publisher first retrieves the EMPLOYEE class which is the Conceptual Business View stored in the Information Catalog. Let us say that from this view he defines a new column called Year.sub.-- to.sub.-- date.sub.-- Pay to be added to the PAY table. This column is a derived item obtained by summing the Pay.sub.-- Amount from the beginning of the year to the current month. The derivation of this item is considered another form of transformation. This new derived column when added to the original Conceptual Business View becomes an enhanced Conceptual Business View as shown below: EMPLOYEE class <About EMPLOYEE> Employee.sub.-- Number Name Address Sex <EMPLOYEE has SKILL> Skill <EMPLOYEE has PAY> Pay.sub.-- Amount Pay.sub.-- Month Year.sub.-- to.sub.-- date.sub.-- Pay This enhanced Conceptual Business View together with the new transformation defined for the Year.sub.--to.sub.-- date.sub.-- Pay item is then stored in the Information Catalog. Next, the publisher defines views to be published. Suppose he wants to publish a view for a specific group of subscribers who are clerks. Such a view to be published could be as follows: EMPLOYEE class for clerks to access <About EMPLOYEE> Employee.sub.-- Number Name Address Sex <EMPLOYEE has PAY> Pay.sub.-- Amount . . . less than 1000 Pay.sub.-- Month Year.sub.-- to.sub.-- date.sub.-- Pay Here we have defined a view which has some items eliminated from the enhanced EMPLOYEE class view, <EMPLOYEE has SKILL> and Skill. In addition, a filter has been added to this smaller view, namely the Pay.sub.-- Amount to be less than 1000 i.e. the clerks are only allowed to see information of those employees earning a Pay.sub.-- Amount less than $1000. This then becomes a Published Business View which is then stored in the Information Catalog. After this the publisher specifies a Publishing Order, say P1, which contains the following illustrative dependency list and tasklist with the workflow of the tasklist of this Publishing Order shown in FIG. 5: Dependency List: E1, E2 Tasklist: P1.1 Retrieve the EMPLOYEE data in the EMP.sub.-- DATA Staging Area and transport it to a database server, say DS1 that is be used as the PERSONNEL Data Warehouse or Data Mart. P1.2 Retrieve the SKILL data in the EMP.sub.-- DATA Staging Area and transport it to the same database server, namely DS1. P1.3 Retrieve the PAY data in the EMP.sub.-- DATA Staging Area and transport it to the same database server, namely DS1. P1.4 Load the EMPLOYEE data into the EMPLOYEE table of PERSONNEL in the DS1 database server. P1.5 Load the SKILL data into the SKILL table of PERSONNEL in the DS1 database server. P1.6 Perform the transformation defined by the publisher, namely derive a new column called Year.sub.-- to.sub.-- date.sub.-- Pay on the PAY data in the DS1 database server. P1.7 Load the PAY data and the derived Year.sub.-- to.sub.-- date.sub.-- Pay data into the PAY table of PERSONNEL in the DS1 database server. This tasklist has two dependencies, namely it cannot start until the Extraction Orders E1 and E2 have been successfully executed. The Publishing Order P1 also contains the time for starting the execution of the tasklist, say, every Monday immediately after Extraction Orders E1 and E2 have been executed successfully. Subscriber Module This module is used by a subscriber who is typically an end-user who wants information or knowledge from the Data Warehouse or Data Mart either for himself or for his boss. This module allows the subscriber to view all Published Business Views that he is authorized to see by the publisher. From a Published Business View he then defines his Subscription Order. A Subscription Order comprises a set of items selected from the Published Business View, any transformation that he wants, the time and frequency of the information to be retrieved and delivered to him, the format that he wants the information to be in when he receives it and the addresses of where the information is to be sent. This Subscription Order is then translated into a tasklist together with the timings specified and stored in the Information Catalog. Let us illustrate this with the same earlier example. Suppose subscriber A accesses the Published Business View which is called the EMPLOYEE class for clerks to access. Let us say he selects the following items from this view: Employee.sub.-- Number Name Pay.sub.-- Amount Pay.sub.-- Month Year.sub.-- to.sub.-- date.sub.-- Pay Let us say he also defines a new item such as Bonus which is a one-tenth of every month's Pay.sub.-- Amount. Deriving this new item is also considered a transformation. This new item is to be added to the above items he has selected. He next specifies that he wants this information to be retrieved every Monday morning and delivered to his and his manager's PC in Excel spreadsheet format not later than 8.00 am. All these specifications constitute a Subscription Order which would be translated by this Subscriber Module into a tasklist with the workflow of the tasklist as follows shown in FIG. 6: Dependency List: P1 Tasklist: S1.1 Generate data access statement that conforms to the requested items from the PERSONNEL Data Warehouse or Data Mart. If the DS1 database server used for PERSONNEL is an SQL database the data access statement is an SQL statement. S1.2 Retrieve the information requested from the PERSONNEL Data Warehouse or Data Mart using the generated SQL. S1.3 Perform transformation to derive the Bonus item using the information retrieved from PERSONNEL. S1.4 Transform the information retrieved including the Bonus item into Excel spreadsheet format. S1.5 Transport the Excel spreadsheet information to the PC of the subscriber and his manager, This Subscription Order also contains the time of delivery, namely every Monday not later than 8 am. This order is then stored in the Information Catalog. System Administrator Module This module is used by a system administrator of the Data Warehouse or Data Mart system. Such a person would typically be a database administrator. He uses this module to define System Policies such as the following: Defining the users who can use the Data Warehouse or Data Mart system i.e. the logon ids, passwords and privileges of suppliers, the publishers and the subscribers as well as other system administrators. Defining the priority of the various users of this system. Defining the constraints imposed on scheduling carried out by the Master Scheduler Module. A constraint could be that the timings for extraction of data from the Data Source cannot be changed. Another constraint could be if there is a conflict between a Publishing Order and a Subscription Order the former takes precedence. Another constraint could also be that if processing of a Publishing or a Subscription Order is still in progress and that the time for extraction of data from the Data Source is due, the processing of the outstanding Publishing Order or Subscription Order could be either suspended or aborted. Master Scheduler Module This module takes as its input the three different types of orders stored in the Information Catalog, namely the Extraction Orders, the Publishing Orders and the Subscription Orders as well as the Scheduling Policies defined by the system administrator using the System Administrator Module. It then schedules these three different types of orders according to the Scheduling Policies defined by the system administrator. One scheduling policy could be scheduling by event using the dependency lists in the Extraction Orders, Publishing Orders or the Subscription Orders. Another policy could be scheduling by time specified in the Extraction Orders, Publishing Orders or the Subscription Orders. And yet another policy could be scheduling by some combination of these policies as well as with other policies such as priority of users. The output of this scheduling of the Extraction Orders, Publishing Orders and Subscription Orders constitute the Master Schedule. After a Master Schedule has been generated, the Master Scheduler Module then issues the Extraction Orders, the Publishing Orders and the Subscription Orders according to the Master Schedule to the Concurrent Manager Module for these orders to be executed. While issuing orders from its Master Schedule the Master Scheduler Module would also be informed by the Information Catalog of new Extraction, Publishing or Subscription Orders that have been created. Upon receiving these new orders it would then reschedule all outstanding orders including those in the Master Schedule to create an updated Master Schedule which it then used to issue orders to the Concurrent Manager Module. Concurrent Manager Module This module takes as its input the orders issued by the Master Scheduler Module. More than one order may be issued by the Master Scheduler Module to this module to be executed. It is therefore designed to manage the execution of many orders concurrently. It first reads the tasklist within each order and then controls and sequences the operations of the various tasks in the tasklist. For each task in the tasklist it first resolves the logical names referred in the task into physical addresses using a directory such as X.500 or Microsoft Exchange that is interfaced to this module. These names could be names of Data Sources, Staging Area, Data Warehouse or Data Mart, or names of users such as publishers, subscribers, or names of locations such as subscribers' destinations. Next, the module determines whether the task is a transformation or a transport task. It then sets up the connection to the physical addresses that has been resolved from the logical names in the task and calls on the appropriate service to execute the task, i.e. either the Transformation Service Module or the Transport Service Module. After an order has been successfully executed, the Concurrent Manager Module then informs the Master Scheduler Module to receive to next order for execution. The Concurrent Manager Module is also designed to handle restart and recovery should the connection to the Transport Service or the Transformation Service fail or should the execution of the task fail to start or fail to complete. Transport Service Module This module is used to transport data from one location to another. The locations, which are physical addresses of logical names in a task resolved by the Concurrent Manager Module as earlier described, could be either the Data Sources, the Staging Area, the Data Warehouses or Data Marts or the Subscribers' Destinations. This module consists of two parts with one part being on the source location and the other on the destination location. The two parts work together to coordinate the data transport. The Transport Service Module is designed to handle high volume data transfer (of the order of gigabytes) from the Data Sources to Staging Area or from Staging Area to Data Warehouses or Data Marts in a reliable and continuable manner. It does this by first breaking up a large file to be transported into small blocks and then transport one block at a time. For example, it could break up a gigabyte of data in a file to be transported into 1000 blocks of one megabyte each. Each megabyte of data would then be transport reliably across. Should a failure occur during the transport it would be able to continue the transfer of data from the affected block onwards instead of from the beginning of the file. It has capability to compress the data before it is transported for high throughput as well as encrypt the data to provide a secured transfer of data. It is also designed to multicast information from Data Warehouses or Data Marts to many Subscribers' Destinations at one time. Transformation Service Module This module is used to perform the various transformations as defined in the tasks within the tasklist of the Extraction Orders, the Publishing Orders and Subscription Orders. It is called upon by the Concurrent Manager Module to perform the transformations. It consists of many small functions each designed to handle one particular type of transformation. It is also able to handle restart and recovery should the execution of the transformation fail. All publications and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference. The invention now being fully described, it will be apparent to one of ordinary skill in the art that many changes and modifications can be made thereto without departing from the spirit or scope of the appended claims.
|
Same subclass Same class Consider this |
||||||||||
