Method and apparatus for populating multiple data marts in a single aggregation process6850952Abstract A method of populating multiple data marts in a single operation from a set of transactional data held in a database in a single aggregation process, in which aggregate values are calculated only once, a determination is made as to which output data marts required the aggregate value, and the aggregate values are output to the appropriate data marts. Dimension data associated with the output aggregate records is also output to the appropriate data marts. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
// martDefList, measureDefList, and measureBitVectorList
// are the data mart definition list, measure definition
// list, and list of measure bit vectors respectively that
// are maintained in the data mart structure.
InitializeMeasuresInfo () {
// initialize the list
measureDefList.clear ();
// first find the unique set of measure definitions
for (i = 0; i < martDefList.size (); i++) {
vColumns =
martDefList[i].factTableDef.GetColumns ();
// loop through each column in target table
for (j = 0; j < vColumns.size (); j ++) {
if (vColumns[j].IsMeasure ()) {
// check if the measure def is already in the
// current measure definition list.
foundIndex = findMeasureDef (measureDefList,
vColumns[j].GetMeasureDef ());
// if new measure
if (foundIndex == -1)
measureDefList.AppendEntry
(vColumns[j].GetMeasureDef ());
}
}
}
// add measure bit vector to the list; one per data mart
for (i = 0; i < martDefList.size (); i++) {
vColumns =
martDefList[i].factTableDef.GetColumns ();
// initialize measure bit vector
bitVector = 0;
// loop through each column in target table
for (j = 0; j < vColumns.size (); j ++) {
if (vColumns[j].IsMeasure ()) {
// find the measure definition from the list;
// the measure definition should be found
foundIndex = findMeasureDef (measureDefList,
vColumns[j].GetMeasureDef ());
measureBit = 1 << foundIndex;
bitVector = bitVector .vertline. measureBit;
}
}
// add the bit vector of required measures to the
// list
measureBitVectorList.AppendEntry (bitVector);
}
}
Once the data mart structure is constructed, it is available for other system components to query user-specified parameters. The following list contains all the access methods required for the data distribution technique. 1) Get All Measure Definitions vMeasuresDef=GetMeasureDefinitions ( ) const Get a constant reference to the measure definition list. The vector contains all distinct measures required by all data marts. Each measure definition contains information of the aggregation type and its input measure column(s). numMeasures=GetNumMeasures ( ) conts Get the total no. of measures in the data mart structure. It is the same as the size of the measure definition list. 2) Get Measure Definition Index index=GetMeasureIndex (const measureDefinition& def) const Find the specified measure definition in the measure definition list of the data mart structure. It returns a 0-based index of the measure found in the list. If the specified measure definition is not found, the method returns -1. 3) Get All Dimension Definitions vDimensionsDef=GetDimensionsDefinitions ( ) const Get a constant reference to the dimension definition list. The vector contains all dimension definitions. Each dimension definition contains the dimension output filtering flag and level codes and conditions. numDimensions=GetNumDimensions ( ) const Get the total no. of dimensions in the data mart structure. It is the same as the size of the dimension definition list. 3) Get Dimension Definition Index index=GetDimensionIndex (const dimensionDef& def) const Find the specified dimension definition in the dimension definition list of the data mart structure. It returns a 0-based index of the dimension found in the list. If the specified dimension definition is not found, the method returns -1. 4) Get All Data Mart Definitions vDataMartsDef=GetDataMartDefinitions ( ) const Get a constant reference to the data mart definition list. The vector contains the definition of all data marts. Each data mart definition contains the dimension and fact table definitions as described in section 1.1.4 5) Get Active Data Marts for a Cross Product vMartBits=GetActiveMarts (const crossProductLevelCode& xprod) const Given a cross product level combination, it determines which data marts will need to generate the specified cross product in its fact output. It basically searches the cross product table and finds the specified cross product. The returning bit vector is indexed the same way as the data mart definition list. A bit set to 1 indicates the corresponding data mart needs to generate/output the specified cross product. 6) Get Active Measures for a Data Mart vMeasBits=GetActiveMeasures (const dataMartDefinition& rMart) const vMeasBits=GetActiveMeasures (int iMart) const Given a data mart definition or a data mart index, it obtains the measure bit vector for the specified data mart. The returning bit vector is indexed the same way as the measure definition list. A bit set to 1 indicates the corresponding measure is required for the data mart. 7) Get Active Data Marts for a Level Code vMartBits=GetActiveMarts (const dimensionDefinition& rDim, const levelCode& code )const vMartBits=GetActiveMarts (int iDim, const levelCode& code )const Get the data marts that need to output dimensional records of the specified level in the specified dimension. This function basically loops through the cross product table and matches the specified level code of the specified dimension in each cross product combination. It then bitwise ORs the data mart bit vector from each found entry. The returned bit vector is indexed the same way as the data mart definition list. A bit set to 1 indicates the corresponding data mart needs to generate/output,the input level cross product. 8) Get Active Level Codes Given a Dimension and a Data Mart vLevelCodes=GetActiveLevels (const dimensionDefinition& rDim, const dataMartDefintion& rMart) const vLevelCodes=GetActiveLevels (int iDim, int iMart) const Given a dimension and a data mart definition or their corresponding indexes, these methods obtain a list of distinct level codes from the specified dimension where the levels are in the list of cross products of the specified data mart. For a particular data mart, whether a dimension record is outputted to a target table depends on whether its level has participated in any cross product combinations of its fact table and the value of the dimension output filtering flag. The "Fact writer" is a functional component responsible for generating the fact tables for all data marts. It takes the list of fact aggregates and its record definition as input, queries the data mart structure for other user-specified parameters, and outputs a fact table for each data mart. Note that a fact table may correspond to more than one data mart if multiple data marts have specified to merge its records to the same fact table. As mentioned previously, each fact aggregate record contains a list of keys, level codes, and measures. However, the keys, level codes, and measures are not required to be in the same order as the dimension definition list and measure definition list in the data mart structure. Their orders are specified in the fact aggregate record definition. Thus, the "fact writer" will need to setup a mapping scheme to re-position the columns from the aggregates record based on the dimension and measure order in the data mart structure. Internally, the fact writer will need to setup a key position vector, a level code position vector, and a measure position vector. The size of the key and level code position vector is the same as the number of dimensions. The size of the measure position vector is the same as the size of the measure definition list in the data mart structure. These vectors basically contain the column positions of the keys, level codes, and measures from the fact aggregate records. They are used for re-mapping the columns so that the fact writer could process the aggregate records in the same dimension or measure order as specified in the data mart structure. The following pseudo-code depicts the major steps of how the different position vectors are set up. The input to the function described therein is the data mart structure.
// After execution of the function, the keyPosVect,
// xProdPosVect, and measurePosVect vectors are setup for
// remapping columns from fact aggregate records. Each item
// in the vectors will contain the corresponding column
// position from the fact aggregate record.
ConstructRemappingVectors ()
{
// initialize size of position vectors.
keyPosVect.resize
(dataMartStructure.GetNumDimensions ());
xProdPosVect.resize
(dataMartStructure.GetNumDimensions ());
measurePosVect.resize
(dataMartStructure.GetNumMeasures ());
// loop through each field definition of the fact
// aggregate record
for (i = 0; i < factAggrRecordDef.size (); i++) {
if (factAggrRecordDef[i].IsKey ()) {
dimDef =
factAggrRecordDef[i].GetDimensionDef ();
// find the index of dimension definition from
// the dimension definition list
foundIndex =
dataMartStructure.GetDimensionIndex (dimDef);
keyPosVect[foundIndex] = i;
}
else
if (factAggrRecordDef[i].IsLevelCode ()) {
dimDef =
factAggrRecordDef[i].GetDimensionDef ();
// find the index of dimension definition from
// the dimension definition list
foundIndex =
dataMartStructure.GetDimensionIndex (dimDef);
xProdPosVect[foundIndex] = i;
}
else { // measure columns
measureDef =
factAggrRecordDef[i].GetMeasureDef ();
// find the index of measure definition from the
// measure definition list in data mart structure
foundIndex =
dataMartStructure.GetMeasureIndex
(measureDef);
// ignore measure that is not active in any data
// marts
if (foundIndex != -1)
measurePosVect[foundIndex] = i;
}
}
}
The "fact writer" also needs to analyze the fact table definition specified in each data mart definition. It needs to maintain a list of distinct target table definitions required by all data marts. Associated with each target table definition is a data mart bit vector, a boolean vector, and a position vector for re-positioning the keys and measures to the position desirable by the target table. The information needed for each distinct fact target table is listed in the structure represented in the following pseudo-code:
factTableInfo {
factTableDef; // fact target table definition
// from data mart structure.
dataMartBitVector; // list of data marts merged
// into this target table.
isKeyVector; // boolean vector to indicate
// key or measure.
positionVector; // integer vector to indicate
// mapping position.
};
The following pseudo-code indicates how to construct the above described information:
// A fact table information list will be initialized with
// distinct target table information.
InitializeFactTablesInfo ()
{
// clear the list of target table info stored in fact
// writer
factTableInfoList.clear ();
// obtain data mart info from data mart structure.
martDefList =
dataMartStructure.GetDataMartDefinitions ();
// for each data mart
for (i = 0; i < martDefList.size (); i ++) {
// check if fact table of current data mart has
// already appeared in the factTableInfoList
foundIndex = findTableDefintion
(martDefList[i].factTableDef,
factTableInfoList);
martBit = 1 << i;
if (foundIndex >= 0) { // if found
// mark the current data mart also active in the
// same target table.
factTableInfoList[foundIndex].dataMartBitVector =
factTableInfoList[foundIndex].dataMartBitVector .vertline.
martBit;
}
else {
// construct a new target table entry
factTableInfo.factTableDef =
martDefList[i].factTableDef;
factTableInfo.dataMartBitVector = martBit;
// get the list of column definitions from
// the table
vColumns =
martDefList[i].factTableDef.GetColumnDefs ();
// set the size of key vector to no. of columns
factTableInfo.isKeyVector.resize
(vColumns.size ());
factTableInfo.positionVector.resize
(vColumns.size ());
// loop through each column in target table
for (j = 0; j < vColumns.size (); j ++) {
if (vColumns[j].IsKey ()) {
factTableInfo.isKeyVector[j] = true;
// find index of dimDef in data mart
// structure
dimIndex =
dataMartStructure.GetDimensionIndex
(vColumns[j].dimensionDef);
factTableInfo.positionVector[j] =
dimIndex;
}
else { // if column is a measure
factTableInfo.isKeyVector[j] = false;
// find index of vColumns[j].measureDef
// in the data mart structure
measureIndex =
dataMartStructure.GetMeasureIndex
(vColumns[j].measureDef);
factTableInfo.positionVector[j] =
measureIndex;
}
}
factTableInfoList.AppendEntry (factTableInfo);
} // else
} // for
}
This pseudo-code populates a list of distinct fact target table information inside the fact writer based on the information from the data mart structure. It loops through each data mart and inspects its fact target table definition. First, it finds if the target table definition is already in the current target table list. If found, that means another data mart is already outputting its results to the same fact target table. Thus, the bit of the current data mart will be set into the data mart bit vector of the found target table information entry. If the data mart is outputting to a new fact target table, a new target table information entry will be added to the list. The bit of the current data mart will be set in the empty data mart bit vector. Also, a boolean and a position vectors will be populated for re-positioning target column purposes. The size of the boolean and position vectors are the same as the no. of columns in the target table. Each entry in the boolean vector indicates whether the output column is a key or a measure. If the column is a key, the corresponding value in the position vector would indicate the dimension index of where the key is from. If the column is a measure, the corresponding value in the position vector would indicate the measure index from the measure definition list in the data mart structure. These indexes are used to re-position the columns when the "fact writer" is given a list of keys and measures in the same order as the dimension and measure definition lists respectively. FIG. 4 describes an example of how the overall mapping mechanism works. After the internal data structure of the fact writer is initialized, the "fact writer" can start reading the list of fact aggregates and distribute the aggregates to different target fact tables. The "fact writer" will only need to read the list of aggregates once. For each record, it would re-map and distribute the mapped record(s) to each participating target table. The following pseudo-code function describes the process of multiple target table distribution given one aggregate record:
WriteAggregateRecord (aggrRecord)
{
// fill up pre-allocated (or data members) keys, xprod,
// and measures lists based on re-mapping columns from
// aggrRecord.
for (i = 0; i < dataMartStructure.GetNumDimensions ();
i ++) {
keys[i] = aggrRecord[keyPosVect[i]];
xProd[i] = aggrRecord[xProdPosVect[i]];
}
for (i = 0; i < dataMartStructure.GetNumMeasures ();
i ++)
measures [i] = aggrRecord[measurePosVect[i]];
// get active data marts for the current cross product
activeMarts = dataMartStructure.GetActiveMarts (xProd);
// loop through each distinct target table
for (i = 0; i < factTableInfoList.size (); i++) {
// if the record is active in the current
// target table
if (activeMarts &
factTableInfoList[i].dataMartBitVector) {
// loop through each column
for (j = 0; j <
factTableInfoList[i].isKeyVector.size ();
j ++) {
position =
factTableInfoList[i].positionVector[j];
// re-position key and measure columns in the
// pre-allocated outputRecord
if (factTableInfoList[i].isKeyVector[j] ==
true)
outputRecord[j] = keys[position];
else
outputRecord[j] = measures[position];
}
// output record to the target table
writeRecordToTable (outputRecord);
}
}
}
There should be an outer loop to read each aggregate record and call this function to write the record to each participating target table. Thus, the above function is executed as many times as the number of records in the input list of aggregate records. The function first re-positions the input aggregate records into three lists: key, cross product, and measure list. The mapped key and cross product lists should have the same order as the dimension definition list in the data mart structure. The mapped measure list should have the same order as the measure definition list. Then, the function determines the active data marts for the cross product of the current aggregate record. For each target table that participates in at least one of the active data marts, the key and measure list will be re-mapped to the desirable column position of the target table. Note that the first step of re-positioning the columns from the input aggregate records is unnecessary if the record is pre-ordered as expected by the data mart structure. This is achievable by utilizing the data mart structure during aggregation generation as explained at the end of this document. Since it is not a requirement for the data mart population technique, a more generic approach has been adopted and this assumption is not put as a pre-condition. "Dimension writer" is a functional component responsible for generating dimension target tables of one dimension for all data marts. It takes a list of dimensional records as input, queries the data mart structure for user-specified parameters, and outputs a dimension table for each data mart. Note that a dimension table may correspond to more than one data mart if multiple data marts have specified to merge its records to the same dimension table. The "dimension writer" needs to set up some internal information before it can start data distribution. First, it needs to find the key, level code, and ever-active switch position based on its input dimension record definition. This could be performed by looping through each field in the dimension record definition as depicted in the pseudo-code shown below. It saves the found key, level code, and active switch position into an internal variable of the dimension writer:
// dimKeyPos - key position from input dimension record.
// dimCodePos - level code position from input dimension
// record.
// dimSwitchPos - active switch position from input
// dimension record.
GetDimPositions ()
{
dimKeyPos = -1;
dimCodePos = -1;
dimSwitchPos = -1;
for (int i = 0 ; i < dimRecordDef.size (); i++) {
if (dimRecordDef[i].IsKey ())
dimKeyPos = i;
else if (dimRecordDef[i].IsLevelCode ())
dimCodePos = i;
else if (dimRecordDef[i].IsActiveSwitch ())
dimSwitchPos = i;
}
}
Before doing multiple dimension tables distribution, a pre-processing step is required to obtain an active data mart bit vector for each dimensional record. The bit vector is used to aid the output filtering option described with respect to the input requirements. Thus, if output filtering is not implemented, this pre-processing step could also be eliminated. The output of the pre-processing step is a list of data mart bit vectors for each dimension where each data mart bit vector corresponds to one dimensional record. The pre-processing step could populate the lists of data mart bit vectors for all dimensions at the same time by reading the list of fact aggregate records once. Initially, the lists of data mart bit vectors of all dimensions are set to 0. Then, for each fact aggregate record, the key and level code pair are found for each dimension. The data mart bit vector associated with the cross products of the levels for each dimension are found using the "Get active marts for a cross product" method. For each key, the dimensional record is found from the corresponding dimensional record list by matching the key. Finally, the data mart bit vector of the found dimensional record will be ORed with the data mart bit vector just obtained from the data mart structure. The found dimensional record is accordingly considered active in the data marts specified in the data mart bit vector as well as any data marts it was already associated with due to appearing in other cross products. Thus, by reading the list of fact aggregates once, the data mart bit vectors of all dimensions will be populated. The pre-processing step can be optimized to be performed during aggregation generation. However, since the initialization of the data mart bit vector list is part of the preferred data mart population technique, the pre-processing step is divided out to provide a more general approach. The "dimension writer" also needs to query the data mart structure and analyze the dimension table definition information for the current dimension specified in each data mart. Internally, the dimension writer needs to maintain a list of distinct dimension target table definitions required by all data marts. Associated with each dimension table definition is an active data mart bit vector, a list of active level codes, and a position vector for re-positioning the keys and attributes to the position desirable by the target table. The information needed for each dimension target table is listed in the following structure:.
dimTableInfo {
dimTableDef; // dimension target table
// definition from data mart
// structure.
dataMartBitVector; // list of data marts merged
// into this target table.
levelCodeVector // contains list of active
// level codes for the table
positionVector; // integer vector to indicate
// mapping position.
};
The following pseudo-code indicates how to construct the above described information:
// A dimension table information list will be initialized
// with distinct target table information.
InitializeDimTablesInfo ()
{
// clear the list of target table info stored in
// dimension writer
dimTableInfoList.clear ();
martDefList =
dataMartStructure.GetDataMartDefinitions ();
// curDimDef is the current dimension definition
curDimIndex =
dataMartStructure.GetDimensionIndex (curDimDef);
for (i = 0; i < martDefList.size (); i ++) {
// check if dimension table of current data mart has
// already appeared in the dimTableInfoList
foundIndex = findTableDefintion
(martDefList[i].dimTableDef[curDimIndex],
dimTableInfoList);
martBit = 1 << i;
if (foundIndex >= 0) { // if found
// add current data mart to data mart bit vector
dimTableInfoList[foundIndex].dataMartBitVector =
dimTableInfoList[foundIndex].dataMartBitVector .vertline.
martBit;
// merge current list of active level codes
// into the found entry
vCodes = GetActiveLevels (curDimIndex, i);
mergeDistinctLevelCodes
(dimTableInfoList[foundIndex].levelCodeVector,
vCodes);
}
else {
// construct a new target table entry
dimTableInfo.dimTableDef =
martDefList[i].dimTableDef[curDimIndex];
dimTableInfo.dataMartBitVector = martBit;
dimTableInfo.levelCodeVector =
GetActiveLevels (curDimIndex, i);
// get the list of columns from the table
vColumns =
martDefList[i].dimTableDef.GetColumnDefs ();
for (j = 0; j < vColumns.size (); j++) {
if (vColumns[j].IsKey ()) {
// key position from input dimension
// record
dimTableInfo.positionVector[j] =
dimKeyPos;
}
else { // attribute column
// key position from input dimension
// record
dimTableInfo.positionVector[j] =
dimRecordDef.FindInputColumn
(vColumns[j].GetInputColumn());
}
} // for
}
} // for
}
This pseudo-code populates a list of distinct dimension target table information inside the writer based on the information from the data mart structure. It loops through each data mart and inspects its target table definition of the current dimension. First, it finds if the target table definition is already in the target table list. If found, that means another data mart is already outputting its results to the same dimension target table. Thus, the bit of the current data mart will be set into the data mart bit vector of the found table information entry. Also, the list of active level codes for the current data mart will be merged into the level code vector of the found entry. If the data mart is outputting to a new dimension target table, a new target table information entry will be added to the list. The bit of the current data mart will be set in the initial empty data mart bit vector. The level code vector entry will be populated with the list of active level codes for the current data mart. Finally, the position vector will contain indexes or column positions from the dimension record. The vector is used for re-positioning the columns from the dimension records to be outputted into the target table. After the internal data structure of the dimension writer is initialized, it can start reading the list of dimensional records and distribute the records to different target tables. The dimension writer will only need to read the list of dimensional records once. For each record, it would re-map and distribute the mapped record(s) to each participating target table. The following pseudo-code function describes the process of multiple target table distribution given one dimensional record and its corresponding active data mart bit vector.
WriteDimensionRecord (dimRecord, activeMartBitVector)
{
for (i = 0; i < dimTableInfoList.size (); i++) {
activeInTable =
activeMartBitVector &
dimTableInfoList[i].dataMartBitVector;
levelInTable = findLevelCode
(dimTableInfoList[i].activeLevelCodes,
dimRecord[dimCodePos]);
outputRecordFlag = false;
// update the ever active switch
if (activeInTable)
dimRecord[dimSwitchPos] = true;
// determine whether to output the record depending
// on output filtering option
if (filterOption == AllRecords) {
if (levelInTable)
outputRecordFlag = true;
}
else
if (filterOption == ActiveInDataMarts) {
if (levelInTable && activeInTable)
outputRecordFlag = true;
}
else
if (filterOption == EverActiveInDataMarts) {
if (levelInTable &&
dimRecord[dimSwitchPos] == true)
outputRecordFlag = true;
}
if (outputRecordFlag == true) {
for (j = 0; j <
dimTableInfoList[i].positionVector.size ();
j ++) {
position =
dimTableInfoList[i].positionVector[j];
outputRecord[j] = dimRecord[position];
}
writeOutputRecord (outputRecord);
}
} // for
}
The active data mart bit vector is the output of the pre-processing step. There should be an outer loop to read each dimensional record and call this function to write the record to each participating target table. Thus, this function is executed as many times as the no. of records in the input list of dimensional records. The function loops through its list of dimension target table information. It first initializes two flags: activeInTable and levelInTable. activeInTable indicates whether the current dimension record is active in the fact. levelInTable indicates whether the level code of the current dimension record is in the level code list of the current target table. It then updates the persistent ever active switch. The switch is set to true if the activeInTable flag is true. Based on the filtering option, the logic determines whether the current dimension record should be outputted to the target table. If the record is to be outputted, the key and attribute columns will be re-mapped to the desirable column position of the target table. It should be noted that the "Ever Active" flag associated with each input dimension record could be replaced with an "Ever Active in Data Marts" bit vector. This would allow dimension records to be output to target data marts if those data marts have ever received fact data corresponding to the data mart in question. However, this would require maintaining persistent data relating to the output data marts, preferably in the input dimension tables, between aggregations. This causes problems keeping the input fact data synchronized with the output data mart definitions and for this reason isn't implemented in the preferred embodiment described herein. The "Ever Active" flag implemented in the preferred embodiment doesn't contain any data dependent on the configuration of the output data marts. The operation of the system components and the flow therebetween are hereinafter described. FIG. 5 depicts an overview of data flow given the input requirements and the system components described in previous sections. As depicted in the diagram, the data mart structure has to be constructed before distribution can start. Once the data mart structure is initialized, the pre-processing step for dimensional records can take place. The pre-processing step establishes the field positions of keys and level codes from all dimension record definitions and fact aggregate record definitions. After that, reading of the list of fact aggregate records is commenced. For each fact aggregate record, the data mart bit vectors of all the fact aggregate record's corresponding dimension records (one from each dimension) will be updated. Details of the pre-processing step were described earlier with reference to the "Dimension Writer". When pre-processing is complete, the operations of the "fact writer" and the "dimension writers" can proceed in parallel. The "fact writer" takes the list of fact aggregate records as input, queries the data mart structure, and generates a fact table for each data mart. Each "dimension writer" takes a list of dimensional records and its corresponding list of data mart bit vectors generated from the pre-processing step as input, queries the data mart structure, and generates a dimension table for each data mart. Thus, the writers do not have any interdependency and they could perform their tasks independently. However, if the data marts are being loaded onto a database into tables in which referential integrity is enforced, at any particular instant, all key values in the fact data must have corresponding values in the dimension data. If the fact writer writes an entry before all its corresponding key values have been written by the dimension writer, there will be a breakdown in referential integrity. If the writers are being used in such an environment, referential integrity enforcement must be disabled before commencing the process and re-enabled when the process finishes. Alternatively, the fact writer could be set to commence only when the dimension writer has finished processing. The following is a summary of the major benefits and advantages of the data mart population technique of the invention: 1) Allow population of multiple data marts in a single aggregation generation process. 2) Much more resource effective to generate multiple data marts in one shot since data marts can have overlapped aggregates requirements or even same levels of aggregations. It eliminates the overhead of reading input fact file and generating the same aggregation buckets multiple times. 3) Allow data marts to have different star schema configurations and different aggregates. The configuration of each data mart is tailored or targeted to the needs of different user groups. 4) Records in dimension table are closely tied to the aggregated records in the fact table. A record is in the dimension table only if the record level participates in the list of output cross products for the corresponding fact table. The data marts generated are much less error prone. 5) Allow data partitioning for large data sets. Logically separated data marts may merge and share dimension tables to avoid unnecessary duplication of dimensional data while having different fact tables. With the flexible scheme of data mart distribution, data could be partitioned or merged in different dimensions or fact tables. 6) Flexible dimension output filtering scheme allows users to choose the right set of dimension records to be outputted to dimension tables. The technique provided in this document is intended to cover a more general scenario. Additional and reasonable assumptions could be applied to the technique for optimizations. This section describes a few variations of possible optimizations. As mentioned before, the dimension output filtering flag is not a requirement for the data mart population technique of the invention. It is provided to enhance the capability of dimension records distribution. The logic has been provided to enable four output filtering options which are: no dimension output, all dimension records, active in data mart, and ever active in data marts. Additional output filtering option could be added based on other combinations of active data mart or level code criteria. The "ever active in data mart" output filtering capability requires dimension records ever seen to be kept persistent because each dimension record needs to maintain an "ever active" flag over time. If the "ever active in data mart" filtering capability is not needed to be supported, then dimension records are also not required to be persistent. In the described technique, no assumptions were made about the measures in a fact aggregate record except that it should be a superset of all measures required. Also, no assumptions were made about the field position of the fact aggregate records and the dimensional records. In the preferred implementation, some reasonable assumptions were made because the data mart structure is constructed before aggregation generation. It is fairly common that users would have all the dimension and data mart definitions before aggregation generation. Given the data mart structure before aggregation generation, the set of required measures is known. The order of dimensions and measures in the data mart structure is also known. In this case, only the required measures are generated during aggregation generation. The keys, level codes, and measures listed in each fact aggregate record could also be in some pre-assumed order. For instance, the first N fields in the record would be key fields in the same order of dimensions as specified in the data mart structure, the next N fields would be level code fields also in the same order of dimensions as specified in the data mart structure, the last M fields would be the measures in the same order as specified in the data mart structure. In this case, the initial re-positioning step from the fact aggregate record could be eliminated. For dimensional records, the key, level code and ever active switch could be assumed to be placed in the first three fields for optimization purposes. If the data mart structure is constructed before aggregation generation, the pre-processing step for populating data mart bit vectors could also be eliminated. The task could be achieved during the same time as aggregation generation. Then there is no need to read the list of fact aggregates once just for the purpose of populating data mart bit vectors. In the preferred implementation, when an input fact record gets aggregated into all its participating aggregation buckets, the data mart bit vectors of all dimensional records related to the input fact record and its aggregation buckets will be updated. The logic in updating the bit vectors is the same as described in the "Dimension Writer" section. However, in this case, it eliminates the need of reading the fact aggregate records one more time. Again, the data mart bit vectors in each dimension are used for output filtering purpose. If no output filtering is supported or the filtering options supported do not require the data mart bit vectors, then the step for populating the data mart bit vectors could be eliminated. While the preferred embodiment of the invention has been shown and described, it will be apparent to those skilled in theart that changes and modifications may be made therein without departing from the spirit of the invention, the scope of which is defined by the appended claims.
|
Same subclass Same class Consider this |
||||||||||
