Code generation

Method for generating code for processing a database

6925630

Abstract

A method is described for generating code for loading and performing other processing operations on a multi-dimensional data warehouse. Databases are defined as entity-relationship data models, using a modelling tool. A source file is created, containing instructions for processing the database, and including one or more high-level directives. At system build time, the source file is pre-processed, by replacing the directives with code, using information pulled from the data model, to generate a destination file containing the code for processing the database. At run time, the generated code is processed to replace run-time macros, indicating physical storage schemes. The generated code is then run, to process the database.


Claims

1. A computer-implemented method for generating code for loading a multi-dimensional data warehouse from a plurality of source databases, the method comprising the steps:

(a) defining the multi-dimensional data warehouse and the source databases as a set of entity-relationship data models;

(b) creating a source file containing instructions for loading the multi-dimensional data warehouse from the plurality of source databases, the instructions including a plurality of high-level directives, said high-level directives including an aggregation directive specifying that data from a specified source table within said source database is to be aggregated into a specified destination table within said data warehouse;

(c) automatically pre-processing said aggregation directive by

(i) accessing said data models to pull information from said entity-relationship data models about the structures of said source table and said destination table, and

(ii) using said information to generate code for aggregating data from the source table into the destination table; and

(d) appending said code to an executable destination file.

2. A computer-implemented method for loading a multi-dimensional data warehouse from a plurality of source databases, the method comprising the steps:

(a) defining the multi-dimensional data warehouse and the source databases as a set of entity-relationship data models;

(b) creating a source file containing instructions for loading the multi-dimensional data warehouse from the plurality of source databases, the instructions including a plurality of high-level directives, said high-level directives including an aggregation directive specifying that data from a specified source table within said source databases is to be aggregated into a specified destination table within said data warehouse;

(c) automatically pre-processing said aggregation directive by

(i) accessing said data models to pull information from said entity-relationship data models about the structures of said source table and said destination table, and

(ii) using said information to generate code for aggregating data from the source table into the destination table;

(d) appending said code to an executable destination file; and

(e) running the code in the executable destination file, to load the multi-dimensional data warehouse from the plurality of source databases.

3. A method according to claim 2 including the step of providing a storage scheme file containing a set of current physical data storage schemes, and wherein said step of creating the source file includes inserting at least one run-time processor macro into the source file, said run-time processor macro identifying at least one of said storage schemes, and wherein said step of running the code includes searching said storage scheme file to find one of said storage schemes identified by said run-time processor macro and replacing said at least one run-time processor macro with a storage clause generated at run time from said one of said storage schemes.

4. A computer system comprising:

(a) a set of entity-relationship data models defining a multi-dimensional data warehouse and a plurality of source databases;

(b) a source file containing instructions for loading the multi-dimensional data warehouse from the plurality of source databases, the instructions including a plurality of high-level directives, said high-level directives including an aggregation directive specifying that data from a specified source table within said source databases is to be aggregated into a specified destination table within said data warehouse;

(c) pre-processing means for automatically pre-processing said aggregation directive by

(i) accessing said data models to pull information from said entity-relationship data models about the structures of said source table and said destination table, and

(ii) using said information to generate code for aggregating data from the source table into the destination table;

(d) means for appending said code generated by pre-processing said plurality of high-level directives to an executable destination file; and

(e) processing means for running the code in the executable destination file, to load the multi-dimensional data warehouse from the plurality of source databases.

5. A computer readable medium, holding a computer executable program for performing a method for generating code for loading a multi-dimensional data warehouse from a plurality of source databases, defined as a set of entity-relationship data models, the method comprising the steps:

(a) creating a source file containing instructions for loading the multi-dimensional data warehouse from the plurality of source databases, the instructions including a plurality of high-level directives, said high-level directives including an aggregation directive specifying that data from a specified source table within said source databases is to be aggregated into a specified destination table within said data warehouse;

(b) automatically pre-processing said aggregation directive by

(i) accessing said data models to pull information from said entity-relationship data models about the structures of said source table and said destination table, and

(ii) using said information to generate code for aggregating data from the source table into the destination table; and

(c) appending said code to an executable destination file.


Description

BACKGROUND TO THE INVENTION

This invention relates to a method for generating code for processing a database, such as for example for loading a multi-dimensional data warehouse from one or more source databases.

There are already products that generate code for loading a database. In these known products, the code is generated directly from the entity-relationship model for the database. However, this approach suffers from several drawbacks. Firstly, to achieve high performance, specific tuning of the generated code is required. This makes it difficult to repeatedly build maintenance and enhancement versions of the product: typically, the directly-generated code is usable only as the starting point for a development. Secondly, the load code in practice is only part of a data warehouse solution-data extracts, automation, outside-database processing and reporting are all typical components, none of which are schema-dependent.

One object of the present invention is to provide a novel technique for generating code for processing a database, which does not suffer from these problems.

SUMMARY OF THE INVENTION

According to the invention a method for generating code for processing a database comprises the steps:

  • (a) defining the database in an entity-relationship data model;
  • (b) creating a source file containing instructions for processing the database, the instructions including one or more high-level directives; and
  • (c) pre-processing the source file, by replacing the directives with code, using information pulled from the data model, to generate a destination file containing the code for processing the database.


  • It can be seen that the present invention adopts a "pull" approach to the problem, instead of the conventional "push" approach. This significantly reduces the problem of tuning the code, since it gives the software developer control over the way the code is generated. Also, the "pull" approach does not restrict the developer to using a particular data modelling tool to create the data model.

    BRIEF DESCRIPTION OF THE DRAWINGS

    FIG. 1 is a schematic diagram showing a method for generating code for processing a database.

    FIG. 2 is a flowchart of a preprocessor used in the method.

    DESCRIPTION OF AN EMBODIMENT OF THE INVENTION

    One embodiment of the invention will now be described by way of example with reference to the accompanying drawings.

    It is assumed in this example that a multi-dimensional data warehouse is to be loaded from a number of individual source databases. In a retail environment, for example, the source databases may hold information about point-of-sale (POS) transactions, products, promotions, customers and stores.

    Referring to FIG. 1, the source databases and the data warehouse, are defined as a set of entity-relationship data models 10. These models may be produced using a commercially available modelling tool, such as ERWIN. The models include definitions of the various tables in the databases and the fields (columns) each table. The models also include additional annotations, to assist in the code generation process. These annotations define properties of the entity (table or column), such as:
    • the type of a table-{fact, summary, dimensional}
    • the partitioning strategy for a table-{none,
    • the external data type of a column.


  • A source file 11 defines the way in which the data warehouse is to be loaded from the individual source databases, and other operations to be performed on the data, such as cleansing and de-duplicating tables. The source file 11 is written in a combination of perl and SQL, and includes a number of special high-level commands, referred to as directives.

    At system build time, a pre-processor program 12 processes the source file 11, as will be described, so as to generate a destination file 13, containing code for processing the data warehouse. The pre-processor program uses information pulled from the data models 10. It also uses a set of include files 14 and template files 15, as will be described. The include files and the template files may be written in a combination of perl and SQL, or possibly any other programming language, and may also include directives.

    Directives

    The name of each directive is prefixed by the symbol # to identify it as such to the pre-processor. By way of example, the directives may include the following.

    #aggregate-table This directive instructs the pre-processor to generate SQL code to aggregate data from a specified source table into a specified destination table.

    #append-table This directive instructs the pre-processor to generate SQL code to append rows from a specified source table to a specified destination table.

    #cleanse-table This directive instructs the pre-processor to generate SQL code to perform cleansing of a specified source table, including supply of default values, deletion of irrelevant columns and addition of new columns. The result of the processing is placed in a specified destination table.

    #create-table This directive instructs the pre-processor to generate SQL code to create a table with a specified name. Optionally, it may specify that the table should have the same database schema as an existing table.

    #deduplicate-table This directive instructs the pre-processor to generate SQL code to deduplicate a specified source table (i.e. to remove records that share the same primary key, so that there remains only one record with any given primary key). The result of the processing is placed in a specified destination table.

    #define This directive defines a macro, i.e. a text substitution which is used throughout the rest of the pre-processing. For example, the directive "#define INTSIZE 4" indicates that the value "4" is to be substituted for "INTSIZE". Macros may contain directives and nested macro calls. A macro, once declared, never goes out of scope.

    #include This directive instructs the pre-processor to include code from a specified one of the include files 14.

    #prefix This directive instructs the pre-processor to modify any word beginning with a specified prefix, by replacing the prefix with a specified new prefix.

    Prefixes allow SQL variants in naming standards, such as standards for naming temporary variables, to be accommodated. For example, a prefix "TT-" may be defined as "#" for a first database, and as " " (i.e. nothing) for a second database. All occurrences of "TT--name>" or "-name>", depending on the target database.

    #template This directive instructs the pre-processor to include code from a specified one of the template files 15. Templates allow SQL variants that use source constructs in different orders to be accommodated. The #template directive supplies the sets of text to be inserted, and the template file itself supplies the keywords and the variant-specific ordering.

    For example, the ordering of the clauses within an SQL INSERT statement to insert records into a table varies between databases. The #template directive allows the developer to specify an INSERT statement independently of the target database, by means of the following code:
    • #template