|
|
|
Query formulation, input preparation, or translation |
Association rule generation and group-by processing system6226634
Abstract
A group-by processing system performs a specified operation computing average value, etc. on a group of records having the same key value, efficiently accesses a secondary storage device, and realizes a high speed process. The group-by processing system includes a unit for storing records; a unit for storing pointers to the records in the unit for storing records at positions, each of which corresponds to a hash function value calculated using the key value of the pointed record; a unit for outputting the records pointed to by the pointers in the unit for storing pointers to secondary storage device, given the hash function values; and a unit for reading a list of output hashed records, sorting the records by key value, and performing a group-by operation on the sorted records. The system also includes a combination generation unit for outputting a combination of two or more items satisfying a combination generation restriction condition; an occurrence count unit for counting occurrences of the combinations; a combination selection unit for selecting the combination which satisfies the given condition of the number of occurrences; and a restriction condition generation unit for assigning the restriction condition to the combination generation unit according to the selection result. The combinations of items can be counted efficiently by performing a high-speed group-by operation.
Claims
What is claimed is:
1. A group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, comprising:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to a hash function value calculated using the key value of the pointed record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to said storage device, given the hash function values for storage positions of the pointers; and
group-by operation execution means for reading a list of hashed records output to said storage device by said output means, sorting the hashed records in the list according to the key value, and performing the group-by operation on the list of the sorted records.
2. The system according to claim 1, wherein said output means outputs the records output corresponding to the hash function values from said record storing means to said storage device as a set of blocks, and comprises:
auxiliary information storing means for storing auxiliary information for retrieving records in the block according to the hash function value.
3. The system according to claim 2, wherein said group-by operation execution means performs the group-by operation on the records using a sequence of hashed records corresponding to the hash function values and output as a set of blocks by said output means and auxiliary information stored in said auxiliary information storing means.
4. A hash system in a group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, comprising:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to a hash function value calculated using the key value of the record; and
output means for outputting the records pointed to by the pointers stored in said pointer storing means to sequential space or non-sequential space in said storage device, given the hash function values for storage positions of the pointers.
5. The system according to claim 4, wherein said output means outputs the records output corresponding to the hash function values from said record storing means to said storage device as a set of blocks, and comprises:
auxiliary information storing means for storing auxiliary information for retrieving records in the block according to the hash function value.
6. A group-by processing system for performing a group-by operation, comprising: means for transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
group-by operation execution means for performing the group-by operation using a sequence of hashed records corresponding to the hash function values stored in sequential space or non-sequential space in said storage device and auxiliary information for retrieving the records in the block according to the hash function value.
7. A group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, comprising:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to a hash function value calculated using the key value of the pointed record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to sequential space or non-sequential space in said storage device as a set of blocks, each of which is output to sequential space, given the hash function values for storage positions of the pointers;
auxiliary information storing means for storing auxiliary information for retrieving records in the block according to the hash function value; and
group-by operation execution means for performing the group-by operation using a sequence of hashed records output by said output means corresponding to the hash function values and auxiliary information stored in said auxiliary information storing means.
8. A hash system in a group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, comprising:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to a hash function value calculated using the key value of the pointed record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to said storage device, given the hash function values for storage positions of the pointers; and
run information storing means for storing run information including addresses, each of which points to a record contained in each run created by said output means in said storage device, given the hash function values from a minimum value to a maximum value.
9. The hash system according to claim 8, wherein said output means outputs the records contained in the same run to sequential space in said storage device, and stores the records of different runs to sequential space or non-sequential space in said storage device.
10. A group-by processing system for performing a group-by operation, comprising: means for transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
group-by operation execution means for performing the group-by operation using a sequence of hashed records obtained as a result of the hash process in said storage device and run information, obtained as a result of the same hash process, including addresses, each of which points to a record contained in each run created by output means in said storage device, given the hash function values from a minimum value to a maximum value.
11. A group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, comprising:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to a hash function value calculated using the key value of the record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to said storage device, given the hash function values for storage positions of the pointers; run information storing means for storing run information including addresses, each of which points to a record contained in each run created by said output means in said storage device, given the hash function values from a minimum value to a maximum value; and
group-by operation execution means for performing the group-by process using a sequence of hashed records output by said output means and stored in said storage device and the run information stored in said run information storing means.
12. A group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, comprising:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to records in said record storing means at positions, each of which corresponds to a hash function value calculated using the key value of the record;
output means for outputting the records pointed to by pointers stored in said pointer storing means to said storage device, given the hash function values for storage positions of the pointers, outputting the records contained in the same run to sequential space in said storage device, given the hash function values from a minimum value to a maximum value, and outputting the records of different runs to sequential space or non-sequential space in said storage device;
run information storing means for storing run information including addresses, each of which points to a record contained in each run in said storage device; and
group-by operation execution means for performing the group-by operation using a sequence of hashed records output by said output means and stored in said storage device and the run information stored in said run information storing means.
13. A hash system in a group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, comprising:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to a hash function value calculated using the key value of the record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to sequential space or non-sequential space in said storage device as a set of blocks, and outputting the blocks with connection data between two blocks if the two blocks should be adjacent with each other and if the blocks cannot be output to sequential space in said storage device, given the hash function values for storage positions of the pointers; and
run information storing means for storing run information including addresses, each of which points to a record contained in each run created by said output means in said storage device, given the hash function values from a minimum value to a maximum value.
14. A group-by processing system for performing a group-by operation, comprising; means for transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
group-by operation execution means for performing the group-by operation using a sequence of hashed records stored in said storage device, run information including addresses, each of which points to a record contained in each run created by output means in said storage device, given the hash function values from a minimum value to a maximum value, and connection data between two blocks which are stored in independent and non-sequential space in said storage device though the blocks should come sequentially.
15. A group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, comprising:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to a hash function value calculated using the key value of the record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to sequential space or non-sequential space in said storage device, and outputting the blocks with connection data between two blocks if the two blocks should be adjacent with each other and if the blocks cannot be output to sequential space in said storage device, given the hash function values for storage positions of the pointers;
run information storing means for storing run information including addresses, each of which points to a record contained in each run created by said output means in said storage device, given the hash function values from a minimum value to a maximum value; and
group-by operation execution means for performing the group-by operation using a sequence of hashed records output by said output means and stored in said storage device, given the hash function values, the run information stored in said run information storing means, and the connection data between the two blocks.
16. A computer readable storage medium storing a process, for a group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, the process comprises the steps of:
temporarily storing the records;
storing pointers to the temporarily stored records at positions, each of which corresponds to a hash function value calculated using the key value of the pointed record;
outputting the records pointed to by the pointers to said storage device, given the hash function values for storage positions of the pointers; and
reading a list of hashed records output to said storage device, sorting the hashed records in the list according to the key value, and performing the group-by operation on the list of the sorted records.
17. A computer readable storage medium storing a process, for a group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, the process comprises the steps of:
temporarily storing the records;
storing pointers to the temporarily stored records at positions, each of which corresponds to a hash function value calculated using the key value of the pointed record;
outputting the records pointed to by the pointers to sequential space or non-sequential space in said storage device as a set of blocks, each of which is output to sequential space in said storage device;
storing auxiliary information for retrieving records in the block according to the hash function value; and
performing the group-by operation using a sequence of hashed records output to said storage device and the auxiliary information.
18. A computer readable storage medium storing a process, for a group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, the process comprises the steps of:
temporarily storing the records;
storing pointers to temporarily stored records at positions, each of which corresponds to a hash function value calculated using the key value of the pointed record;
outputting the records pointed to by the pointers in said storage device, given the hash function values for storage positions of the pointers;
storing run information including addresses, each of which points to a record contained in each run created in said storage device, given the hash function values from a minimum value to a maximum value; and
performing the group-by operation using a sequence of hashed records stored in said storage device and the run information stored in said run information storing means.
19. A computer readable storage medium storing a process, for a group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, the process comprises the steps of:
temporarily storing the records;
storing pointers to the temporarily stored records at positions, each of which corresponds to a hash function value calculated using the key value of the pointed record;
outputting the records pointed to by the pointers to said storage device, given the hash function values for storage positions of the pointers, outputting the records contained in the same run to sequential space in said storage device, given the hash function values from a minimum value to a maximum value, and outputting the records of different runs to sequential space or non-sequential space in said storage device;
storing run information including addresses, each of which points to a record contained in each run in said storage device; and
performing the group-by process using a sequence of hashed records output and stored in said storage device and the run information.
20. A computer readable storage medium storing a process, for a group-by processing system for performing a group-by operation based on a hash method of transforming records to be hashed stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record, the process comprises the steps of:
temporarily storing the records;
storing pointers to the temporarily stored records at positions, each of which corresponds to a hash function value calculated using the key value of the record;
outputting the records pointed to by the pointers to said storage device as a set of blocks, each of which is output to sequential space in said storage device, given the hash function values for storage positions of the pointers, and for outputting the blocks with connection data between two blocks if the two blocks should be adjacent with each other and if the blocks cannot be output to sequential space in said storage device;
storing run information including addresses, each of which points to a record contained in each run created in said storage device, given the hash function values from a minimum value to a maximum value; and
performing the group-by process using a sequence of hashed records stored in said storage device, the run information, and the connection data between the two blocks.
21. A related data combination count system for obtaining individual items or combinations of two or more items from a plurality of transactions containing one or more items, and determining the number of occurrences of the individual items or the combinations of two or more items which satisfy a given condition of the number of occurrences in the transactions, comprising:
combination generation means for outputting each item in each transaction when the individual item is to be obtained, and for generating every combination of the number of items to be obtained from each transaction and outputting only the combination satisfying a combination generation restriction condition that partial combinations in the combination satisfy the given condition of the number of occurrences, said partial combination comprising the number of items less than the number of items to be obtained, when the combination of two or more items is to be obtained;
occurrence count means for counting occurrences, in all transactions, of the individual item or the combination of two or more items output by said combination generation means;
combination selection means for selecting the individual item or the combination of two or more items which satisfies the given condition of the number of occurrences; and
restriction condition generation means for assigning the combination generation restriction condition to said combination generation means according to the individual item or the combination of two or more items selected by said combination selection means.
22. The system according to claim 21, wherein said restriction condition generation means set a combination generation restriction condition by generating a bit map with specified bit value at bit positions, each of which corresponds to the individual item or the combination of two or more items selected by said combination selection means; and
said combination generation means generates only the combination of items, in which partial combinations, each of which comprises the number of items less than the number of items to be obtained, correspond to one of the individual items or the combinations of items associated with the bit positions containing the specified bit value in the bit map when said combination generation means generates the combination of two or more items.
23. The system according to claim 22, wherein said restriction condition generation means sets the specified bit value at the bit position in the bit map corresponding to a hash function value for the individual item or the combination of two or more items selected by said combination selection means.
24. The system according to claim 21, wherein said combination generation means outputs, in addition to the items contained in the transactions, a combination of items or an individual item contained in another level in a hierarchical structure of items, excepting ancestor items in the hierarchical structure of items.
25. The system according to claim 21, wherein said combination generation means generates a combination of items from a sequence of items in time-series contained in the transactions with keeping the order of the sequence.
26. The system according to claim 21, wherein said occurrence count means performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said system further comprises:
record storing means for temporarily storing the records;
pointer storing means pointers to the records in said record storing means at positions, each of which corresponds to the hash function value calculated using the key value of the pointed record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to said storage device, given the hash function values for storage positions of the pointers; and
group-by operation execution means for reading a list of hashed records output to said storage device by said output means, sorting the hashed records in the list according to the key value, and performing the group-by operation on the list of the sorted records.
27. The system according to claim 26, wherein said output means outputs the records output corresponding to the hash function values from said record storing means to said storage device as a set of blocks, and comprises:
auxiliary information storing means for storing auxiliary information for retrieving records in the block according to the hash function value.
28. The system according to claim 27, wherein said group-by operation execution means performs the group-by process on the records using a sequence of hashed records corresponding to the hash function values and output as a set of blocks by said output means and using auxiliary information stored in said auxiliary information storing means.
29. The system according to claim 21, wherein said occurrence count means performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said system further comprises in a hash processing system:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to the hash function value calculated using the key value of the record; and
output means for outputting the records pointed to by the pointers stored in said pointer storing means to sequential space or non-sequential space in said storage device, given the hash function values for storage positions of the pointers.
30. The system according to claim 29, wherein said output means outputs the records output corresponding to the hash function values from said record storing means to said storage device as a set of blocks, and comprises:
auxiliary information storing means for storing auxiliary information for retrieving records in the block according to the hash function value.
31. The system according to claim 21, wherein said occurrence count means performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said system further comprises:
group-by operation execution means for performing the group-by operation using a sequence of hashed records corresponding to the hash function values stored in sequential space or non-sequential space in said storage device and auxiliary information for retrieving the records in the block according to the hash function value.
32. The system according to claim 21, wherein said occurrence count means performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said system further comprises:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to the hash function value calculated using the key value of the record; and
output means for outputting the records pointed to by the pointers stored in said pointer storing means to sequential space or non-sequential space in said storage device as a set of blocks, each of which is output to sequential space, given the hash function values for storage positions of the pointers;
auxiliary information storing means for storing auxiliary information for retrieving records in the block according to the hash function value; and
group-by operation execution means for performing the group-by operation using a sequence of hashed records output by said output means corresponding to the hash function values and the auxiliary information stored in said auxiliary information storing means.
33. The system according to claim 21, wherein said occurrence count means performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said system further comprises in a hash processing system:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to the hash function value calculated using the key value of the pointed record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to said storage device, given the hash function values for storage positions of the pointers; and
run information storing means for storing run information including addresses, each of which points to a record contained in each run created by said output means in said storage device, given the hash function values from a minimum value to a maximum value.
34. The hash system according to claim 33, wherein said output means outputs the records contained in the same run to sequential space in said storage device, and stores the records of different runs to sequential space or non-sequential space in said storage device.
35. The system according to claim 21, wherein said occurrence count means performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said system further comprises:
group-by operation execution means for performing the group-by operation using a sequence of hashed records obtained as a result of the same hash process in said storage device and run information, obtained as a result of the same hash process, including addresses, each of which points to a record contained in each run created by output means in said storage device, given the hash function values from a minimum value to a maximum value.
36. The system according to claim 21, wherein said occurrence count means performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said system further comprises:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to the hash function value calculated using the key value of the record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to said storage device, given the hash function values for storage positions of the pointers;
run information storing means for storing run information including addresses, each of which points to a record contained in each run created by said output means in said storage device, given the hash function values from a minimum value to a maximum value; and
group-by operation execution means for performing the group-by operation using a sequence of hashed records output by said output means and stored in said storage device and the run information stored in said run information storing means.
37. The system according to claim 21, wherein said occurrence count means performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said system further comprises:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to records in said record storing means at positions, each of which corresponds to the hash function value calculated using the key value of the record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to said storage device, given the hash function values for storage positions of the pointers, outputting the records contained in the same run to sequential space in said storage device, given the hash function values from a minimum value to a maximum value, and outputting the records of different runs to sequential space or non-sequential space in said storage device;
run information storing means for storing run information including addresses, each of which points to a record contained in each run in said storage device; and
group-by operation execution means for performing the group-by operation using a sequence of hashed records output by said output means and stored in said storage device and the run information stored in said run information storing means.
38. The system according to claim 21, wherein said occurrence count means performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said system further comprises in a hash processing system:
record storing means for temporarily storing the records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to the hash function value calculated using the key value of the record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to sequential space or non-sequential space in said storage device as a set of blocks, and outputting the blocks with connection data between two blocks if the two blocks should be adjacent with each other and if the blocks cannot be output to sequential space in said storage device, given the hash function values for storage positions of the pointers; and
run information storing means for storing run information including addresses, each of which points to a record contained in each run created by said output means in said storage device, given the hash function values from a minimum value to a maximum value.
39. The system according to claim 21, wherein said occurrence count means performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said system further comprises:
group-by operation execution means for performing the group-by operation using a sequence of hashed records stored in said storage device, run information including addresses, each of which points to a record contained in each run created by output means in said storage device, given the hash function values from a minimum value to a maximum value, and connection data between two blocks which are stored in independent and non-sequential space in said storage device though the blocks should come sequentially.
40. The system according to claim 21, wherein said occurrence count means performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said system further comprises:
record storing means for temporarily storing records;
pointer storing means for storing pointers to the records in said record storing means at positions, each of which corresponds to the hash function value calculated using the key value of the record;
output means for outputting the records pointed to by the pointers stored in said pointer storing means to sequential space or non-sequential space in said storage device, and outputting the blocks with connection data between two blocks if the two blocks should be adjacent with each other and if the blocks cannot be output to sequential space in said storage device, given the hash function values for storage positions of the pointers;
run information storing means for storing run information including addresses, each of which points to a record contained in each run created by said output means in said storage device, given the hash function values from a minimum value to a maximum value; and
group-by operation execution means for performing the group-by operation using a sequence of hashed records output by said output means and stored in said storage device, given the hash function values, the run information stored in said run information storing means, and the connection data between the two blocks.
41. A computer readable storage medium storing a process, in a related data combination count system for obtaining individual items or combinations of two or more items from a plurality of transactions containing one or more items, and determining the number of occurrences of the individual item or the combination of two or more items which satisfy a given condition of the number of occurrences in the transactions, the process comprises the steps of:
outputting each item in each transaction when the individual item is to be obtained, and generating every combination of the number of items to be obtained from each transaction and outputting only the combination satisfying a combination generation restriction condition that partial combinations in the combination satisfy the given condition of the number of occurrences, said partial combination comprising the number of items less than the number of items to be obtained, when the combination of two or more items is to be obtained;
counting occurrences, in all transactions, of the individual item or the combination of two or more items output by said outputting function;
selecting the individual item or the combination of two or more items which satisfies the given condition of the number of occurrences; and
assigning the combination generation restriction condition to said outputting function according to the individual item or the combination of two or more items selected by said selecting function.
42. The storage medium according to claim 41, wherein said step of counting occurrences performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said process further comprises the steps of:
temporarily storing the records;
storing pointers to the temporarily stored records at positions, each of which corresponds to the hash function value calculated using the key value of the pointed record;
outputting the records pointed to by the pointers to said storage device, given the hash function values for storage positions of the pointers; and
reading a list of hashed records output to said storage device, sorting the hashed records in the list according to the key value, and performing the group-by operation on the list of the sorted records.
43. The storage medium according to claim 41, wherein said step of counting occurrences performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said process further comprises the steps of:
temporarily storing the records;
storing pointers to the temporarily stored records at positions, each of which corresponds to the hash function value calculated using the key value of the pointed record; and
outputting the records pointed to by the pointers to sequential space or non-sequential space in said storage device as a set of blocks, each of which is output to sequential space in said storage device;
storing auxiliary information for retrieving records in the block according to the hash function value; and
performing the group-by operation using a sequence of hashed records output to said storage device and the auxiliary information.
44. The storage medium according to claim 41, wherein said step of counting occurrences performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said process further comprises the steps of:
temporarily storing the records;
storing pointers to the temporarily stored records at positions, each of which corresponds to the hash function value calculated using the key value of the pointed record;
outputting the records pointed to by the pointers in said storage device, given the hash function values for storage positions of the pointers;
storing run information including addresses, each of which points to a record contained in each run created in said storage device, given the hash function values from a minimum value to a maximum value; and
performing the group-by operation using a sequence of hashed records stored in said storage device and the run information stored in said run information storing means.
45. The storage medium according to claim 41, wherein said step of counting occurrences performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said process further comprises the step of:
temporarily storing the records;
storing pointers to the temporarily stored records at positions, each of which corresponds to the hash function value calculated using the key value of the record;
outputting the records pointed to by the pointers to said storage device, given the hash function values for storage positions of the pointers, outputting the records contained in the same run to sequential space in said storage device, given the hash function values from a minimum value to a maximum value, and outputting the records of different runs to sequential space or non-sequential space in said storage device;
storing run information including addresses, each of which points to a record contained in each run in said storage device; and
performing the group-by operation using a sequence of hashed records output and stored in said storage device and the run information.
46. The storage medium according to claim 41, wherein said step of counting occurrences performs a group-by operation to count the occurrences of the items based on a hash process, which transforms records including the individual items or the combination of two or more items stored in a storage device into a referenceable storage form using hash function values, each of which corresponds to a key value of the record; and
said process further comprises the steps of:
temporarily storing records;
storing pointers to the temporarily stored records at positions, each of which corresponds to the hash function value calculated using the key value of the record;
outputting the records pointed to by the pointers to said storage device as a set of blocks, each of which is output to sequential space in said storage device, given the hash function values for storage positions of the pointers, and for outputting the blocks with connection data between two blocks if the two blocks should be adjacent with each other and if the blocks cannot be output to sequential space in said storage device;
storing run information including addresses, each of which points to a record contained in each run created in said storage device, given the hash function values from a minimum value to a maximum value; and
performing the group-by process using a sequence of hashed records stored in said storage device, the run information, and the connection data between the two blocks.
47. A method of counting related data combinations for obtaining individual items or combinations of two or more items from a plurality of transactions containing one or more items, and determining the number of occurrences of the individual items or the combinations of two or more items which satisfy a given condition of the number of occurrences in the transactions, comprising the steps of:
outputting each item in each transaction when the individual item is to be obtained, and generating every combination of the number of items to be obtained from each transaction and outputting only the combination satisfying a combination generation restriction condition that partial combinations in the combination satisfy the given condition of the number of occurrences, said partial combination comprising the number of items less than the number of items to be obtained, when the combination of two or more items is to be obtained;
counting occurrences, in all transactions, of the individual item or the combination of two or more items output by said outputting step;
selecting the individual item or the combination of two or more items which satisfies the given condition of the number of occurrences; and
assigning the combination generation restriction condition to said outputting step according to the individual item or the combination of two or more items selected by said selecting step.
48. The method according to claim 47, wherein said assigning step further comprises the step of:
creating, for each number of items to be obtained, a bit map with specified bit value at positions corresponding to the individual items or the combinations of two or more items selected by said selecting step, and
said outputting step further comprises the step of:
using at least one bit map for the number of items less than the number of items to be obtained, and determining that the partial combination in the combination corresponds to the individual items or combinations of two or more items associated with the positions with specified bit value in the bit map.
49. The method according to claim 48, wherein each position in the bit map corresponds to a hash function value, calculated using identification numbers for individual items or for the combinations of two or more items.
50. The method according to claim 47, wherein, in addition to the items contained in the transactions, individual items or combinations of two or more items contained in another level in a hierarchical structure of items, excepting ancestor items in the hierarchical structure of items, and the number of occurrences of the individual items or the combinations of two or more items are obtained.
51. The method according to claim 47, wherein individual items or combinations of two or more items satisfying the given condition and the number of occurrences of the individual items or the combinations of two or more items are obtained from a sequence of items in time-series contained in the transactions with keeping the original order of the sequence.
52. A method of counting related data combinations for obtaining individual items or combinations of two or more items from a plurality of transactions containing one or more items, and determining the number of occurrences of the individual items or the combinations of two or more items which satisfy a given condition of the number of occurrences in the transactions, comprising the steps of:
(a) counting individual items in each transaction, and obtaining the number of occurrences of each individual item in all transactions;
(b) selecting individual items having the number of occurrences which meets the given condition, and outputting the individual items and the number of occurrences;
(c) generating a bit map for the individual items with specified bit value set at bit positions, each of which corresponds to one of the output items;
(d) generating combinations of two items, from transactions of two or more items, including only items corresponding to the bit position having the specified bit value in the bit map;
(e) counting the combinations of two items in the transactions of two or more items, and obtaining the number of occurrences of each combination of two items;
(f) selecting the combinations of two items having the number of occurrences which meets the given condition, and outputting the selected combinations of two items and the number of occurrences;
(g) generating a bit map for the combinations of two items with specified bit value set at bit positions, each of which corresponds to one of the selected combination of two items, and
updating the bit map for the individual items with specified bit value set at bit positions, each of which corresponds to one of the two items of the selected combination of two items;
(h) generating combinations of three items, from transactions of three or more items, including only one item or combination of two items corresponding to the bit positions having the specified bit value in the previous generated bit maps;
(i) counting the combinations of three items in the transactions of three or more items, and obtaining the number of occurrences of each combination of three items;
(j) selecting the combinations of three items having the number of occurrences which meets the given condition, and outputting the selected combinations of three items and the number of occurrences;
(k) generating a bit map for the combinations of three items with specified bit value set at bit positions, each of which corresponds to one of the selected combination of three items,
updating the bit map for the combinations of two items with specified bit value set at bit positions, each of which corresponds to one of the three combinations of two items of the selected combination of three items, and
updating the bit map for the individual items with specified bit value set at bit positions, each of which corresponds to one of the three items of the selected combination of three items; and
(l) repeating steps (h) through (k) for combinations of four or more items in the same manner.
53. The method according to claim 52, wherein each position in the bit maps corresponds to a hash function value, calculated using identification numbers for individual items or for the combinations of two or more items.
54. The method according to claim 52, wherein, in addition to the items contained in the transactions, individual items or combinations of two or more items contained in another level in a hierarchical structure of items, excepting ancestor items in the hierarchical structure of items, and the number of occurrences of the individual items or the combinations of two or more items are obtained.
55. The method according to claim 52, wherein individual items or combinations of two or more items satisfying the given condition and the number of occurrences of the individual items or the combinations of two or more items are obtained from a sequence of items in time-series contained in the transactions with keeping the original order of the sequence.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to processing of a large amount of data normally stored in a database. Specifically, it relates first to a group-by process, and second to a database mining process.
The present invention relates to a process of classifying a large amount of records depending on a key value of each record and performing a specified operation on a group of records having the same key value, such as, for example, obtaining an average value. Such a process is referred to as a group-by process. The present invention relates more specifically to a group-by processing system based on a hash process, that is, a system of hashing a large volume of records according to a hash function value obtained by applying an appropriate hash function to a key value, generating a list of the hashed records, sorting the hashed records in the list according to the key value, and performing a group-by process on the sorted records of the resultant list.
The present invention also relates to a database mining process for obtaining a rule of the relationship among data stored in a database, and more specifically to a system of counting occurrences of a combination of related data in a large volume of data in the database. According to a count result of the system, a process of generating an association rule is performed as one of the data mining methods using a combination which meets a given condition and the number of occurrences of the combination. The association analysis based on the association rule is carefully considered in the U.S. and many other nations.
2. Description of the Related Art
Normally, an operation in the group-by process, that is, an operation on a group of records having the same key value, for example, the same item number, can be a count process for counting records such as item sets, an operation for computing a total or average of specific field values of the group of records, etc. These group-by processes are frequently performed in a relational database process, a statistical process, etc.
A group-by processing system can be based on a sort process or a hash process. A system based on the sort process is performed by continuously accessing records having the same key value by sorting a group of records according to the key values. That is, the group of records are first sorted according to the key values, and a list of resultant sorted records is searched from the beginning. A specified operation is repeated for the records having the same key value. When the key value changes, the operation is initialized.
In a hash process, a group of records are read into an input buffer, and hash function values, which respectively correspond to key values of the records to be hashed, are calculated by using a hash function. Then, each of the records in the input buffer is stored in one of record buffers according to the hash function value of the record. If a record buffer, which corresponds to one of the hash function values, is filled with records, the records in the record buffer are output into one of the hashed lists, which respectively correspond to a record buffer, in a secondary storage.
FIG. 164 is a flowchart showing an example of a conventional group-by processing system based on the sort process. When the process starts as shown in FIG. 164, a group of records to be processed in a group-by process is sorted according to key values in step S201. In step S202, the leading record in the group of records is read. In step S203, a function is initialized. In step S204, an operation of the function is performed on the read record. In step S205, it is determined whether or not any record still exists in the group of the sorted records.
When a record exists, the leading record in the group of the sorted records is read in step S206. In step S207, it is determined whether or not the key value of the record is equal to the key value of the previously read record. When the key values are equal to each other, the processes in and after steps S204 are repeated.
When the key value of a record is not equal to that of the previous record, a termination process is performed on the function in step S208. In step S209, the result of the function process and the record are output as a resultant record, and the processes in and after step S203 are repeated.
If it is determined in step S205 that the group of the sorted records have been read, then the termination process is performed on the function in step S210, and the result of the function process performed on the previous record and the record are output as a resultant record in step S211, and the process terminates.
FIGS. 165A through 165F are practical examples showing the proceedings of the group-by process performed according to the flowchart shown in FIG. 164. As shown in FIG. 165A, a group of records to be processed in a group-by process contains 10 records, and each record comprises only a key value, for example, an item number for simple representation. When the process in step S201 is completed, the state shown in FIG. 165B is realized.
In this example, the operation in the group-by process is a count process for obtaining the number of records having the same key value. When the processes in steps S202 and S203 terminate, the state shown in FIG. 165C is entered. That is, in the initialization of the function shown in step S203, the count value is set to 0.
In step S204, a count value is incremented only by 1 to enter the state shown in FIG. 165D. The determination in step S205 is YES, and in step S206, it is determined that the current record `1` is equal to the previous record, and the new record `1` becomes a current record. The determination in step S207 is YES, the count value is incremented in step S204, and the state shown in FIG. 165E is entered.
The determination in step S205 is YES again, and a new record is read in step S206. The value of the key of the record is 2, and the value is different from the value of the previous record, which is 1. Therefore, the termination process is performed on a function in step S208. When a counting operation is performed, the termination process is only to fix the current count value, and the fixed value indicates the result of the process of the function and is added to the previous record, that is, `1` in this example, to be output as a resultant record in step S209. That is, the output resultant record is `1, 2`. FIG. 165F shows the result.
By repeatedly performing the processes, the following group of records can be finally obtained as a process result.
1, 2
2, 2
3, 3
4, 1
5, 2
The result indicates that the group of records to be processed in the group-by process contains two records having the key value of 1, two records having the key value of 2, three records having the key value of 3, one record having the key value of 4, and two records having the key value of 5.
FIG. 166 shows an example of a conventional group-by processing system based on the hash process. When FIG. 166 is compared with the flowchart based on the sort process shown in FIG. 164, the group of records to be processed in the group-by process are stored in one of record buffers according to the hash function values in step S221. Then, records in each record buffer are sorted according to key values in step S222, all sorted contents in the record buffers are connected to form a string, and the processes almost the same as the processes in and after step S202 shown in FIG. 164 are performed in steps S223 through S232.
FIGS. 167A through 167F show a practical example of the proceedings of the process performed according to the flowchart shown in FIG. 166. The group of records to be processed in the group-by process are the same as those shown in FIGS. 165A through 165F. When the process is based on the hash process, the group of records to be processed in the group-by process are hashed using an appropriate hash function. In this example, mod 3 is used as a hash function. That is, the value of the key is divided by 3, and the record is distributed to a hash bucket depending on the value of the remainder. If the remainder is i, a hash bucket i stores the record. In this example, the value of the remainder can be 0, 1, or 2, and there can be three hash buckets.
The state shown in FIG. 167B shows the state after the process in step S221 shown in FIG. 166 and the end of hash. FIG. 167C shows the result of sorting a group of records in each hash bucket according to key values. FIG. 167D shows the result of integrating the contents of all hash buckets into a string. Thus, the processes up to step S222 terminate. The subsequent processes are performed in the same manner as the sort process. Finally, the following records are obtained as a resultant group of records.
3, 3
1, 2
4, 1
2, 2
5, 2
Although the order of the records is different from the order shown in FIGS. 165A through 165F, the entire group outputs the same result as the case based on the sort process.
Since the operation of counting the combinations of data is part of the association rule generation process in the above described database mining, the association rule is first described. The group-by processing system according to the present invention is used as part of the process of counting the combinations of data as described later.
For example, let's suppose that according to 100 customers' receipts collected through POS (point-of-sales) at a retail shop, 20 customers bought product A, and 12 customers bought both products A and B. One product is called an item, and a receipt slip is called a transaction. One transaction normally contains many items. Based on the following definition, the support of the product A is 20%, and the support of the products A and B is 12%.
Support of items=number of transactions containing items/total number of transactions
Furthermore, based on a simple conditional probability computation, the conclusion that '60% (12%/20%) of the customers who buy the product A buy also the product B' can be obtained. This is represented by 'A->B, confidence of 60%, and support of 12%'. It is defined as an association rule. The confidence of the association rule `A->B` is computed by:
Confidence of `A->B`=support of A.LAMBDA.B (both A and B are bought)/support of A
Furthermore, in addition to the simple rule such as `A->B`, etc., a complicated rule such as `A.LAMBDA.B->C.LAMBDA.D.LAMBDA.E` (a customer who buys A and B also buys C, D, and E), etc. can be applied. The confidence in this example can be computed as follows.
Confidence of `A.LAMBDA.B->C.LAMBDA.D.LAMBDA.E`=support of A.LAMBDA.B.LAMBDA.C.LAMBDA.D.LAMBDA.E / support of A.LAMBDA.B
The association rule can be applied, for example, to the evaluation of the contribution of a special service product, the optimization of store shelves (arranging products for good combinations), the improvement of a hit ratio of direct mail from the data of credit cards, etc.
The association rule generating process includes the steps of: (1) counting the occurrences of combinations of items which satisfy the condition of a given support in the transactions; and (2) computing the rule, support, and confidence based on the combinations and their number of occurrences obtained in step (1). The present invention is designed to improve the counting process in step (1).
In what follows, a combination of items is referred to as `itemset`, and a combination of items which satisfy the condition of the given support and is found in step (1) is referred to as a `large itemset`. The condition of the support is in the range from the minimum value (0% [=counting any itemset purchased in a transaction] through 100% [=counting itemsets purchased in all transactions]) to the maximum value (minimum value<=maximum value <=100%). Conventionally, the maximum value is fixed to 100% in many cases.
Since the process of counting a large itemset takes a long time, various processing methods have been proposed. Especially, the SETM algorithm based on the SQL, and the Apriori algorithm which is one of the algorithms suggested by IBM, are known as typical algorithms. The SETM algorithm and the Apriori algorithm are described in the following documents 1 and 2.
document 1: Maurice Houtsma and Arun Swami. Set-Oriented Mining for Association Rules in Relational Databases. In Proceedings of the IEEE Data Engineering Conference, pages 25-33, 1995.
document 2: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of the 20th VLDB Conference, pages 487-499, Santiago, Chile, 1994.
The association rule generation according to the SETM algorithm is based on the SQL language which is a relational database query language, and can be easily implemented. In this process, an SQL join operation and a group-by operation are performed. A self-join operation is performed using a table of transactions containing the large itemsets of length k-1 (the large (k-1)-itemsets) in order to generate potential candidates of large k-itemsets (the candidate k-itemsets). Then, these candidate k-itemsets are counted in a group-by operation, and those satisfying the minimum support condition form the large k-itemsets. Furthermore, the transactions which contain large k-itemsets are generated in a join operation, and are used in generating the candidate (k+1)-itemsets.
FIG. 168 shows an illustrative example of SETM. FIG. 169 shows the contents of each function block in the SETM algorithm Using these figures, the SETM algorithm is described in details in what follows.
In FIG. 168, a table R1' shows the items contained in each transaction t_x. For example, transaction t_x 1 contains items 1, 2, and 3. A GB(1) performs the counting (a group-by operation) of the occurrences of each item. A table L1 contains the items and their number of occurrences in the transactions for the case that the minimum support value is 20% and maximum support value is 100%. Thus, all items which appear in two or more transactions are contained in L1.
A table R1 shows the result obtained in the join process J(1) by extracting from R1' the transactions whose items appear in table L1.
An SJ(1) does the self-join of table R1. As a result of the operation, possible 2-itemsets can be generated for each transaction and inserted in table R2'.
In the group-by process GB(2) on R2', the occurrences of 2-itemsets are counted. Those whose counts are equal to or larger than 2 compose the large 2-itemsets and are inserted in table L2.
Similarly, a table L3 for large 3-itemsets and a table L4 for large 4-itemsets are generated. However, the table L4 is empty.
In the Apriori algorithm, a candidate k-itemset is generated using the set of large (k-1)-itemsets. When the whole set of large (k-1)-itemsets are completely stored in memory, candidate k-itemsets are generated by first joining large (k-1)-itemsets. For each k-itemset resultant from this join, all (k-1) subsets are generated and checked against the large (k-1)-itemsets. Only when all the subsets are large (k-1)-itemsets, it is assumed that the join result of length k is a candidate k-itemset. Thus, in Apriori, large k-itemsets are stored in a hash-table in memory and can be used to efficiently prune the unnecessary candidates. Furthermore, the candidate k-itemsets are stored in a hash-tree in memory, and by scanning the transactions, the k-itemsets within the transactions and that are found in the hash-tree have their counter incremented. Unnecessary k-itemsets contained in the transactions are not counted by only counting those found in the hash-tree.
FIG. 170 shows an illustrative example of the Apriori algorithm. FIG. 171 shows the contents of each function block in the processing of the Apriori algorithm.
In FIG. 170, the contents of a list TL containing 8 transactions are actually the same as those shown in FIG. 168. First, each of the items contained in these transactions is input in Subset (1), and the occurrences of each item are counted and inserted in C1. The count result is input into F, and the items having the number of occurrences equal to or larger than 2 are selected through a filtering operation, and generate the set of large 1-itemsets L1, that are organized in a hash-table.
A self-join of all items in L1 is executed by AG(1), and the result is the set of candidate 2-itemsets C2, that is organized in a hash-tree. If a transaction list TL contains a 2-itemset that is stored in the hash-tree as a candidate itemset, the counter of that candidate itemset is incremented by Subset (2). By filtering the result through F, the candidate 2-itemsets having the occurrence number equal to or larger than 2 are in fact large itemsets L2 that are stored in a hash-table.
By performing similar processes, 3-itemsets having the occurrence numbers equal to or larger than 2 are obtained as large 3-itemsets L3. The process terminates when, as shown in FIG. 170, it is determined that no 4-itemsets having the occurrence number equal to or larger than 2 exist.
First, the problems with the conventional technology relating to the group-by processing system are described. As described above, the group-by processing system can be based on either the sorting process or the hash process. In the sorting process, the entire group of records should be simultaneously sorted, thereby requiring a long time and a high cost.
On the other hand, in the hash process, a lower cost is required than in the sorting process because the sorting operation is performed in the hash process only on a group of records contained in each record buffer. However, if the entire group of records to be processed in the group-by processing system is too large to be stored in the main memory, then the records in the record buffer should be written to the secondary storage device and the records should be read from the secondary storage device into the record buffer, and the access to the secondary storage device is costly.
That is, in the conventional hash processing system, record buffers are provided in the main memory, and a record is stored in the record buffer corresponding to the hash function value for the record. When a record buffer is full, the records in the record buffer are output to one of the hashed lists. Each hashed list corresponds to a record buffer, and is provided in the secondary storage device.
In the processes described above, each of the output processes is performed individually, and the records stored in one of the record buffers are sequentially output to the corresponding one of the hashed list in the secondary storage device. Therefore, in each of the hashed lists, each record is stored in a region next to a region in which the previous record is stored.
However, from the viewpoint of a memory region of the secondary storage device in which all of the hashed lists are included, the record output processes are non-sequentially performed, and the records output from each of the record buffers are discontinuously stored in the memory region.
Therefore, according to the conventional hash processing system, many empty regions are formed in the memory region of the secondary storage device, and this reduces the available storage capacity of the secondary storage device and the operating performance of the computer system.
Described below are some problems with the SETM algorithm and the Apriori algorithm for finding large itemsets in the generation of association rules. Concerning the SETM algorithm, since there are no means for pruning unnecessary combinations in the self-join operation SJ(k-1), Rk' becomes too large and thus, the subsequent group-by process GB(k) becomes too heavy.
The Apriori algorithm solves this problem of SETM algorithm. Apriori generates the candidate k-itemsets Ck to be counted in a pass k by joining the large (k-1)-itemsets L(k-1), and pruning those k-itemsets that contain any subset that is not a large (k-1)-itemset. This procedure results in the generation of a much smaller number of candidate itemsets Ck than in SETM. However, this pruning of k-itemsets requires that all the large (k-1)-itemsets are maintained in memory. If they can not fit in memory, the pruning can no longer be done. But, for AG(1) in Apriori, the join operation is almost a Cartesian Product of items in L1 and thus, the generation of C2 is a very heavy process. As illustrated in FIGS. 168 and 170, the number of 2-itemsets in C2 is even greater than the number of 2-itemsets in R2' for SETM, since for the generation of C2 in Apriori there is no restriction that the 2 items should belong to the same transaction.
Furthermore, in the Apriori algorithm, the k-itemsets in the transaction are counted based on the candidate k-itemsets that are stored in a hash-tree. Therefore, if all k-itemsets in the hash-tree can not be stored in the memory, they are partitioned so that each partition can fit in the available memory space. For each such partition, the transactions have to be scanned once. This is a problem when the set of transactions is too huge and its reading requires a large amount of time.
SUMMARY OF THE INVENTION
The first subject of the present invention is to perform a hash process at high-speed by sequentially reading and writing data in relatively large blocks to gain the efficient access to a secondary storage device like the completely sequential access, and to perform group-by operation at high-speed by using a result of the hash process.
The second subject of the present invention is to efficiently prune the number of itemsets to count which, in the Apriori algorithm, can not be done when the whole set of candidate itemsets doesn't fit in the available memory. Instead of holding all the candidate itemsets themselves, our invention uses a bit map capable of constantly fitting an available memory capacity. Thus, it can adjust the bit map size and thus the pruning capability so that unnecessary itemsets can be pruned, and the group-by process for counting itemsets is performed at a higher speed than in the conventional technology, thereby efficiently counting the combinations of related data.
A feature of the present invention belong to a group-by processing system for performing a group-by operation based on a hash method. The group-by processing system includes a record storing unit for temporarily storing records; a pointer storing unit for storing pointers to the records in the record storing unit at positions each of which corresponds to a hash function value calculated using the key value of the pointed record; a output unit for outputting the records pointed to by the pointers stored in the pointer storing unit to the storage device, given the hash function values for storage positions of the pointers; and a group-by operation execution unit for reading a list of hashed records output to the storage device by the output unit, sorting the hashed records in the list according to the key value, and performing the group-by operation on the list of the sorted records.
Another feature of the present invention resides in a related data combination count system for obtaining individual items or a combination of two or more items in a number of transactions containing one or more items as data, and a number of occurrences of individual items or the combination of two or more items when the individual items or the combination of two or more items satisfy a given condition of the number of occurrences in the transactions. The system includes a combination generation unit for outputting an item in each transaction when the individual items is to be obtained, and generating and outputting a combination of two or more items in each transaction satisfying a combination generation restriction condition on subsets of the combinations of items when the combination of two or more items is to be obtained; an occurrence count unit for counting occurrences, in all transactions, of the individual items or the combinations of two or more items output by the combination generation unit; a combination selection unit for selecting the individual items or the combinations of two or more items which is output by the occurrence count unit and satisfy the given condition of the number of occurrences; and a restriction condition generation unit for assigning the combination generation restriction condition to the combination generation unit according to the selection result of the combination selection unit.
Thus, group-by operation can be realized at high-speed according to the present invention by sequentially reading and writing data in relatively large blocks in order to gain the efficient access to a secondary storage device like the completely sequential access. As a result, the group-by operation is performed at high-speed based on the result of the hash process. Next, with the introduction of the data combination count system according to the present invention, the pruning of unnecessary combinations of items can be efficiently performed even if the available memory capacity is small by using a bit map which fits an available memory capacity. Furthermore, the process of counting combinations of items can be performed efficiently by performing a high-speed group-by process in which itemsets are counted.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will be more apparent from the following detailed description, when taken in conjunction with the accompanying drawings, in which:
FIG. 1A is a block diagram showing the configuration according to the principle of the group-by processing system of the present invention;
FIG. 1B is a block diagram showing the configuration according to the principle of the related data combination counting system of the present invention;
FIG. 2 shows a computer embodying the group-by processing system according to the present invention.
FIG. 3 shows the hash process according to the first embodiment of the group-by processing system;
FIG. 4 is a flowchart showing the entire hash process according to the first embodiment of the group-by processing system;
FIG. 5 is a flowchart showing the output process shown in FIG. 4;
FIG. 6 is a flowchart showing the termination process shown in FIG. 4;
FIG. 7 is a flowchart showing the output buffer output process shown in FIGS. 5 and 6;
FIGS. 8A through 8O show the proceedings of the hash process according to the first embodiment of the group-by processing system;
FIG. 9 shows the hashed list obtained as a result of the hash process according to the first embodiment of the group-by processing system;
FIG. 10 shows the hash process according to the second embodiment of the group-by processing system;
FIG. 11 shows a flowchart showing the output buffer output process according to the second embodiment of the group-by processing system;
FIG. 12 is a flowchart showing the auxiliary information list sort process (first method);
FIG. 13 is a flowchart showing the auxiliary information list sort process (second method);
FIG. 14 is a flowchart showing the auxiliary information list sort process (third method);
FIG. 15 is a flowchart showing the auxiliary information list sort process (fourth method);
FIG. 16 is a flowchart showing the auxiliary information list sort process (fifth method);
FIG. 17 is a flowchart showing the auxiliary information list sort process (sixth method);
FIG. 18 shows the entire group-by function process according to the first and second embodiments of the group-by processing system;
FIG. 19 is a flowchart showing the entire group-by function process according to the first and second embodiments of the group-by processing system;
FIG. 20 is a flowchart showing the group-by function operation process;
FIG. 21 shows the hash process according to the third embodiment of the group-by processing system;
FIG. 22 shows a flowchart showing the output buffer output process according to the third embodiment of the group-by processing system;
FIG. 23 shows the entire group-by function process according to the third embodiment of the group-by processing system;
FIG. 24 is a flowchart showing the entire group-by function process according to the third embodiment of the group-by processing system;
FIG. 25 shows the hash process according to the fourth embodiment of the group-by processing system;
FIG. 26 is a flowchart showing the output process in the hash process according to the fourth embodiment of the group-by processing system;
FIG. 27 is a flowchart showing the termination process in the hash process according to the fourth embodiment of the group-by processing system;
FIG. 28 is a flowchart of the output buffer output process shown in FIGS. 26 and 27;
FIG. 29 is a flowchart showing the output process when the hashed record output region is non-sequential in the fourth embodiment of the group-by processing system;
FIG. 30 shows an example of an output result when the hashed record output region is non-sequential in the fourth embodiment of the group-by processing system;
FIG. 31 shows the entire group-by function process according to the fourth embodiment of the group-by processing system;
FIG. 32 is a flowchart showing the entire group-by function process according to the fourth embodiment of the group-by processing system;
FIG. 33 shows the hash process according to the fifth embodiment of the group-by processing system;
FIG. 34 is a flowchart showing the output buffer output process in the hash process according to the fifth embodiment of the group-by processing system;
FIG. 35 shows the entire group-by function process according to the fifth embodiment of the group-by processing system;
FIG. 36 is a flowchart showing the entire group-by function process according to the fifth embodiment of the group-by processing system;
FIG. 37 shows the entire embodiment of the group-by processing system according to the present invention using practical data;
FIG. 38 shows the configuration of the hash process device shown in FIG. 37;
FIG. 39 is a flowchart showing the entire process of the group-by processing system according to the present invention using a practical example;
FIG. 40 is a flowchart showing the hash process using practical data;
FIG. 41 is a flowchart showing the entire group-by process using practical data;
FIG. 42 is a flowchart showing the count process shown in FIG. 41;
FIG. 43 shows the proceedings of the hash process using practical data;
FIG. 44 shows the proceedings of the process of retrieving the minimum hash function value record corresponding to the first embodiment of the group-by processing system using practical data;
FIG. 45 shows the proceedings of the process of retrieving the minimum hash function value record corresponding to the fourth embodiment of the group-by processing system using practical data;
FIG. 46 shows the configuration according to the embodiment of the data combination count system of the present invention;
FIG. 47 is a flowchart showing the entire process of the data combination count system of the present invention;
FIG. 48 is a flowchart showing the large itemset generation process;
FIG. 49 is a chart (1) showing the generation process C(1) and the count process G(1) of individual items;
FIG. 50 is a chart (2) showing the generation process C(1) and the count process G(1) of individual items;
FIG. 51 is a chart (3) showing the generation process C(1) and the count process G(1) of individual items;
FIG. 52 is a chart (4) showing the generation process C(1) and the count process G(1) of individual items;
FIG. 53 is a chart (1) showing the selection of the large 1-itemset L(1);
FIG. 54 is a chart (2) showing the selection of the large 1-itemset L(1);
FIG. 55 is a chart (3) showing the selection of the large 1-itemset L(1);
FIG. 56 is a chart (4) showing the selection of the large 1-itemset L(1);
FIG. 57 is a chart (5) showing the selection of the large 1-itemset L(1);
FIG. 58 is a chart (6) showing the selection of the large 1-itemset L(1);
FIG. 59 is a chart (1) showing the processes up to the process G(2) of counting combinations of two items;
FIG. 60 is a chart (2) showing the processes up to the process G(2) of counting combinations of two items;
FIG. 61 is a chart (3) showing the processes up to the process G(2) of counting combinations of two items;
FIG. 62 is a chart (4) showing the processes up to the process G(2) of counting combinations of two items;
FIG. 63 is a chart (1) showing the selection of the large 2-itemset L(2);
FIG. 64 is a chart (2) showing the selection of the large 2-itemset L(2);
FIG. 65 is a chart (3) showing the selection of the large 2-itemset L(2);
FIG. 66 is a chart (4) showing the selection of the large 2-itemset L(2);
FIG. 67 is a chart (5) showing the selection of the large 2-itemset L(2);
FIG. 68 is a chart (6) showing the selection of the large 2-itemset L(2);
FIG. 69 is a chart (7) showing the selection of the large 2-itemset L(2);
FIG. 70 is a chart (8) showing the selection of the large 2-itemset L(2);
FIG. 71 is a chart (9) showing the selection of the large 2-itemset L(2);
FIG. 72 is a chart (10) showing the selection of the large 2-itemset L(2);
FIG. 73 is a chart (1) showing the processes up to the process G(3) of counting combinations of three items;
FIG. 74 is a chart (2) showing the processes up to the process G(3) of counting combinations of three items;
FIG. 75 is a chart (3) showing the processes up to the process G(3) of counting combinations of three items having the length of 3;
FIG. 76 is a chart (4) showing the processes up to the process G(3) of counting combinations of three items having the length of 3;
FIG. 77 is a chart (1) showing the selection of the large 3-itemset L(3);
FIG. 78 is a chart (2) showing the selection of the large 3-itemset L(3);
FIG. 79 is a chart (3) showing the selection of the large 3-itemset L(3);
FIG. 80 shows a practical example of the hierarchical structure of data used in the hierarchical association analysis;
FIG. 81 is a chart (1) showing the processes up to the process G(1) in the hierarchical association analysis;
FIG. 82 is a chart (2) showing the processes up to the process G(1) in the hierarchical association analysis;
FIG. 83 is a chart (3) showing the processes up to the process G(1) in the hierarchical association analysis;
FIG. 84 is a chart (4) showing the processes up to the process G(1) in the hierarchical association analysis;
FIG. 85 is a chart (1) showing the selection of L(1) in the hierarchical association analysis;
FIG. 86 is a chart (2) showing the selection of L(1) in the hierarchical association analysis;
FIG. 87 is a chart (3) showing the selection of L(1) in the hierarchical association analysis;
FIG. 88 is a chart (4) showing the selection of L(1) in the hierarchical association analysis;
FIG. 89 is a chart (5) showing the selection of L(1) in the hierarchical association analysis;
FIG. 90 is a chart (6) showing the selection of L(l) in the hierarchical association analysis;
FIG. 91 is a chart (7) showing the selection of L(1) in the hierarchical association analysis;
FIG. 92 is a chart (8) showing the selection of L(1) in the hierarchical association analysis;
FIG. 93 is a chart (9) showing the selection of L(1) in the hierarchical association analysis;
FIG. 94 is a chart (10) showing the selection of L(1) in the hierarchical association analysis;
FIG. 95 is a chart (1) showing the processes up to the process G(2) in the hierarchical association analysis;
FIG. 96 is a chart (2) showing the processes up to the process G(2) in the hierarchical association analysis;
FIG. 97 is a chart (3) showing the processes up to the process G(2) in the hierarchical association analysis;
FIG. 98 is a chart (4) showing the processes up to the process G(2) in the hierarchical association analysis;
FIG. 99 is a chart (1) showing the selection of L(2) in the hierarchical association analysis;
FIG. 100 is a chart (2) showing the selection of L(2) in the hierarchical association analysis;
FIG. 101 is a chart (3) showing the selection of L(2) in the hierarchical association analysis;
FIG. 102 is a chart (4) showing the selection of L(2) in the hierarchical association analysis;
FIG. 103 is a chart (5) showing the selection of L(2) in the hierarchical association analysis;
FIG. 104 is a chart (6) showing the selection of L(2) in the hierarchical association analysis;
FIG. 105 is a chart (7) showing the selection of L(2) in the hierarchical association analysis;
FIG. 105 is a chart (8) showing the selection of L(2) in the hierarchical association analysis;
FIG. 107 is a chart (9) showing the selection of L(2) in the hierarchical association analysis;
FIG. 108 is a chart (10) showing the selection of L(2) in the hierarchical association analysis;
FIG. 109 is a chart (11) showing the selection of L(2) in the hierarchical association analysis;
FIG. 110 is a chart (1) showing the processes up to the process G(3) in the hierarchical association analysis;
FIG. 111 is a chart (2) showing the processes up to the process G(3) in the hierarchical association analysis;
FIG. 112 is a chart (3) showing the processes up to the process G(3) in the hierarchical association analysis;
FIG. 113 is a chart (4) showing the processes up to the process G(3) in the hierarchical association analysis;
FIG. 114 is a chart (1) showing the selection of L(3) in the hierarchical association analysis.
FIG. 115 is a chart (2) showing the selection of L(3) in the hierarchical association analysis.
FIG. 116 is a chart (3) showing the selection of L(3) in the hierarchical association analysis.
FIG. 117 is a chart (4) showing the selection of L(3) in the hierarchical association analysis.
FIG. 118 is a chart (5) showing the selection of L(3) in the hierarchical association analysis.
FIG. 119 is a chart (6) showing the selection of L(3) in the hierarchical association analysis.
FIG. 120 is a chart (1) showing the processes up to the process G(4) in the hierarchical association analysis;
FIG. 121 is a chart (2) showing the processes up to the process G(4) in the hierarchical association analysis;
FIG. 122 is a chart (3) showing the processes up to the process G(4) in the hierarchical association analysis;
FIG. 123 is a chart (4) showing the processes up to the process G(4) in the hierarchical association analysis;
FIG. 124 shows an example of a sequence list in a time-series analysis;
FIG. 125 is a chart (1) showing the processes up to the process G(1) in the time-series analysis;
FIG. 126 is a chart (2) showing the processes up to the process G(1) in the time-series analysis;
FIG. 127 is a chart (3) showing the processes up to the process G(1) in the time-series analysis;
FIG. 128 is a chart (4) showing the processes up to the process G(1) in the time-series analysis;
FIG. 129 is a chart (5) showing the processes up to the process G(1) in the time-series analysis;
FIG. 130 is a chart (1) showing the selection of L(1) in the time-series analysis;
FIG. 131 is a chart (2) showing the selection of L(1) in the time-series analysis;
FIG. 132 is a chart (3) showing the selection of L(1) in the time-series analysis;
FIG. 133 is a chart (4) showing the selection of L(1) in the time-series analysis;
FIG. 134 is a chart (5) showing the selection of L(1) in the time-series analysis;
FIG. 135 is a chart (6) showing the selection of L(1) in the time-series analysis;
FIG. 136 is a chart (7) showing the selection of L(1) in the time-series analysis;
FIG. 137 is a chart (8) showing the selection of L(1) in the time-series analysis;
FIG. 138 is a chart (1) showing the processes up to the process G(2) in the time-series analysis;
FIG. 139 is a chart (2) showing the processes up to the process G(2) in the time-series analysis;
FIG. 140 is a chart (3) showing the processes up to the process G(2) in the time-series analysis;
FIG. 141 is a chart (4) showing the processes up to the process G(2) in the time-series analysis;
FIG. 142 is a chart (5) showing the processes up to the process G(2) in the time-series analysis;
FIG. 143 is a chart (1) showing the selection of L(2) in the time-series analysis;
FIG. 144 is a chart (2) showing the selection of L(2) in the time-series analysis;
FIG. 145 is a chart (3) showing the selection of L(2) in the time-series analysis;
FIG. 146 is a chart (4) showing the selection of L(2) in the time-series analysis;
FIG. 147 is a chart (5) showing the selection of L(2) in the time-series analysis;
FIG. 148 is a chart (6) showing the selection of L(2) in the time-series analysis;
FIG. 149 is a chart (7) showing the selection of L(2) in the time-series analysis;
FIG. 150 is a chart (8) showing the selection of L(2) in the time-series analysis;
FIG. 151 is a chart (9) showing the selection of L(2) in the time-series analysis;
FIG. 152 is a chart (10) showing the selection of L(2) in the time-series analysis;
FIG. 153 is a chart (11) showing the selection of L(2) in the time-series analysis;
FIG. 154 is a chart (1) showing the processes up to the process G(3) in the time-series analysis;
FIG. 155 is a chart (2) showing the processes up to the process G(3) in the time-series analysis;
FIG. 156 is a chart (3) showing the processes up to the process G(3) in the time-series analysis;
FIG. 157 is a chart (4) showing the processes up to the process G(3) in the time-series analysis;
FIG. 158 is a chart (5) showing the processes up to the process G(3) in the time-series analysis;
FIG. 159 is a chart showing the selection of L(3) in the time-series analysis;
FIG. 160 shows the flow of the process according to the present invention in the basic association analysis;
FIG. 161 shows the proceedings of the hashed list generation process in the item combination count process;
FIG. 162 shows the hashed list as a result of the process shown in FIG. 161;
FIG. 163A shows the proceedings of the minimum hash function value record retrieval process for the hashed list shown in FIG. 162;
FIG. 163B shows the configuration of the computer embodying the group-by processing system according to the present invention.
FIG. 164 is a flowchart showing an example of the conventional technology of the group-by process based on the sort process;
FIGS. 165A through 165F show the proceedings of the practical process using the flowchart shown in FIG. 164;
FIG. 166 is a flowchart showing an example of the conventional technology of the group-by process based on the hash process;
FIGS. 167A through 167F show the proceedings of the practical process using the flowchart shown in FIG. 166;
FIG. 168 shows the flow of the practical process in the SETM algorithm;
FIG. 169 shows the contents of each function block process in the SETM algorithm;
FIG. 170 shows the flow of the practical process in the Apriori algorithm; and
FIG. 171 shows the contents of each function block in the Apriori algorithm.
DETAILED DESCRIPTION OF THE EMBODIMENTS
FIG. 1A is a block diagram showing the configuration according to the principle of the group-by processing system of the present invention. The group-by processing system shown in FIG. 1A performs a group-by operation for transforming a group of records to be hashed stored in a storage unit into a referenceable storage form using hash function values, each of which corresponds to the key value of the record.
In FIG. 1A, a record storing unit 106 is, for example, a record buffer for temporarily storing a group of records to be processed in a group-by process. A pointer storing unit 107 is, for example, a hash table for storing pointers to the records stored in the record storing unit 106 at positions, each of which corresponds to a hash function value calculated from the key value of the record.
A output unit 108 outputs the records pointed to by the pointers stored in the pointer storing unit 107 to storage, given hash function values for the storage positions of the pointers.
A group-by operation execution unit 109 reads a list of hashed records output to the storage device by the output unit 108, sorts the list of the records by its key value, and performs a group-by operation on the list of the sorted records.
The group-by processing system according to the present invention can include, for example, an auxiliary information list storage unit in addition to the components shown in FIG. 1A. When the output unit 108 outputs the records to the storage device as a set of blocks, the auxiliary information list stores a list of auxiliary information for use in retrieving the records in the block according to the hash function value.
In this case, the group-by operation execution unit 109 performs a group-by operation using a hashed list, which includes the records each corresponding to the hash function value, obtained by hash process, and an auxiliary information list for use in retrieving the records in the block according to the hash function value.
In the group-by processing system according to the present invention, a run information storage unit can replace the above described auxiliary information list storage unit. When the output unit 108 outputs to the storage device a record output from the record storing unit 106 according to the hash function value, the run information storage unit stores run information. The run information includes an identifier for indicating a group of records output while the hash function value is continuously changed from the minimum value to the maximum value.
In this case, the group-by operation execution unit 109 performs a group-by operation using the records stored in the hashed list, and the run information stored in the run information storage unit.
According to the present invention, using either the auxiliary information list storage unit or the run information storage unit, the records are stored in the hashed list when the output unit 108 repeats an operation of continuously outputting the records as a block to continuous regions in the storage device with the hash function value continuously changed, for example, from the minimum value to the maximum value. The output operation is repeated until there are no records to be processed in the group-by process.
In the group-by processing system according to the present invention, the process of accessing a storage device can be performed at high speed by continuously outputting the records output corresponding to the hash function values in the block to a secondary storage device such as a hard disk, etc.
FIG. 1B is the block diagram showing the combination count system according to the present invention. In FIG. 1B, the combination count system obtains individual items or combinations of two or more items in a number of transactions containing one or more items, and a number of occurrences of individual items or the combinations of two or more items when the individual items or the combinations of two or more items satisfy a given condition of the number of occurrences in the transactions.
In FIG. 1B, a combination generation unit 100 outputs an item in each transaction when the individual item is to be obtained, and generates and outputs a combination of items satisfying a combination generation restriction condition when the combination of two or more items is obtained. The generation restriction condition is concerned to the number of occurrences of each item and/or subsets of items in the combination of two or more items, which should meet the given condition of the number of occurrences in the transactions.
An occurrence count unit 101 counts the occurrences, in all transactions, of the individual item or the combination of two or more items output by the combination generation unit 100 in, for example, a group-by process.
A combination selection unit 102 selects the individual item or the combination of two or more items which is output by the occurrence count unit 101 and satisfies the given condition of the number of occurrences, for example, more than or equal to a specified number of times.
A restriction condition generation unit 103 assigns the combination generation restriction condition to the combination generation unit 100 according to the selection result of the combination selection unit 102. The combination generation restriction condition can be represented by a bit map having `1` at a bit position corresponding to the item or combination of items appearing in the selection result of the combination selection unit 102.
In the count system according to the present invention, the combination generation and counting are processed in multiple steps. In the first step, individual items are generated and counted. In each subsequent step, one item longer combinations are generated and counted.
In the first processing step, the combination generation unit 100 retrieves each item from each transaction. The occurrences of each item are counted by the occurrence count unit 101. At this time, the combination generation restriction condition can not be applied since no combination generation restriction condition is given yet. All items in the transactions are output one by one from the combination generation unit 100, and the occurrences are counted.
The output of the occurrence count unit 101, that is, the count result of the number of occurrences of each item in the transactions is provided to the combination selection unit 102. The combination selection unit 102 selects, for example, the items whose number of occurrences is equal to or larger than a predetermined value.
The output result is provided to the restriction condition generation unit 103. The bit corresponding to an item contained in the output of the combination selection unit 102 is set to `1` in a bit map. The bit corresponding to an item that is not contained in the output remains set to `0`. Thus, the bit map is generated and is provided to the combination generation unit 100 as a generation restriction condition for the generation of combinations of two items in the next processing step.
In the second processing step, for each transaction, the combination generation unit 100 generates combinations of two items that satisfy the combination generation restriction condition. In this case, the restriction condition is that each item has the corresponding bit set to `1` in the bit map. These combinations of two items are output to the occurrence count unit 101. The occurrence count unit 101 counts the occurrences of the combinations of two items, and provides the result to the combination selection unit 102. The combination selection unit 102 selects and outputs combinations of two items meeting a given condition of the number of occurrences. The output result is provided to the restriction condition generation unit 103. The bits corresponding to each item and/or combination of two items contained in the output of the combination selection unit 102 are set to `1` in the bitmap which represents the combination generation restriction condition for the generation of the combination of three items in the next processing step.
For the processing of combinations of three items, first the combination generation unit 100 generates combinations of three items whose bits corresponding to each items and/or corresponding to each subset of two items are set to `1` in the bitmap which represents the restriction condition generated by unit 103. The generated combinations of three items are output and counted in the occurrence count unit 101. The output of the occurrence count unit 101 is provided to the combination selection unit 102, which only outputs those combinations of three items whose number of occurrences satisfies the specified number of occurrences.
By repeating similar processing steps, combinations of one item longer are generated, as in the examples shown in FIGS. 168 and 170 for the conventional SETM and Apriori algorithms. The process of the combination count system terminates when no combination of items is output from the combination selection unit 102.
The present invention shown in FIGS. 1A and 1B is summarized as follows. A group-by operation performs a specified operation, for example, average value computation, on records having the same key value of the record, efficiently accesses a secondary storage device, aims at performing the process at high-speed, and includes record storing unit 106 for temporarily storing the records; pointer storing unit 107 for storing pointers to the records stored in the record storing unit 106 at positions, each of which corresponds to a hash function value computed from the key value of the record; output unit 108 for outputting to the secondary storage device the records specified by the pointers in the pointer storing unit 107 corresponding to the hash function values; and group-by operation execution unit 109 for reading a list of output and hashed records, sorting the records by the key values of the records, and performing a group-by operation on the sorted records.
Concerning the combination generation and counting, the hash-table used in the conventional Apriori technology requires a memory capacity proportional to the number of individual items or combinations of items in a large itemset, and there is no way to prune unnecessary itemsets by only using a portion of the large itemsets that can be stored in a hash-table in the available memory. The only way to perform the efficient pruning of unnecessary candidate itemsets is to hold the whole set of large itemsets in a hash-table in memory. Therefore, in the conventional technology, when the volume of data to be processed is large, the large itemset is also large and the hash-table possibly exceeds the available memory capacity and thus, the pruning process cannot be performed. The pruning process according to the present invention is performed using one or more bit maps which represent an individual item or a combination of two or more items that exists in a large itemset. Since the bit map can represent items or a combination of two or more items using a single bit, it can improve the pruning effectiveness by enlarging the bit map size, or this size can be reduced in order to reduce the memory requirements. Thus the present invention can solve the above described problem of no pruning at all unless the whole set of large itemsets fits in an available memory.
FIG. 2 shows the group-by processing system 200 for embodying the group-by processing system according to the present invention. An example of the group-by processing system comprises a processing device 210, an external storage device 220, an input unit 230, and a network 240.
The processing device 210 comprises a CPU 211 and memory 212. The CPU 211 performs a group-by operation. The memory 212 temporarily stores records to be processed in the group-by operation, and other necessary data. Furthermore, the memory 212 is loaded with a program for executing the group-by operation, and other programs such as an OS, etc.
The external (secondary) storage device 220 can be typically a hard disk, a floppy disk, an optical disk, etc., and stores records processed and to be processed in the group-by operation, etc. Furthermore, the external storage device 220 stores a program for executing the group-by operation, and other programs such as an OS, etc.
The input unit 230 is not required in the present invention, but can be used as necessary when an instruction involved in the group-by operation and a record to be processed are input.
The network 240 connects the group-by processing system 200 to another computer or network to transmit and receive necessary data, signals, etc.
The group-by processing system according to the present invention performs a group-by operation based on the result of a hash process. Described below is an embodiment of the group-by processing system according to the present invention by explaining a hash processing system and a group-by processing system.
FIG. 3 shows the hash system according to the first embodiment of the group-by processing system according to the present invention. The hash system comprises an input buffer 13, a record buffer 14, a hash table 15, a link control table 16, an output hash value memory 17, a hashed list output buffer 18, and an auxiliary information output buffer 19, as shown in FIG. 3.
The records to be hashed in the hash list 11 are read into the input buffer 13, and then stored in the record buffer 14. Pointers each indicating the storing position of a record in the record buffer 14 are stored in the hash table 15 in association with a hash function value obtained by a calculation of the key value of the corresponding record. The link control table 16 is for controlling a link between records in the record buffer 14 having the same hash function value.
If the record buffer 14 becomes full, a record in the record buffer 14 is indicated by one of the pointers stored in the hash table 15 and is output to the hashed list output buffer 18. The pointer used in the output of the record is designated by an output hash function value which is determined from the output hash value memory 17. Every time a record is output to the hashed list output buffer 18 from the record buffer 14, another record stored in the input buffer 13 is read into the record buffer 14 from the input buffer 13.
After the record is output to the hashed list output buffer 18, another record in the record buffer 14 is indicated by a pointer designated by the next output hash value in the output hash value memory 17 and is stored in the hashed list output buffer 18.
When the hashed list output buffer 18 becomes full, all of the records stored in the output buffer 18 are output to the hashed list 21 as a block. Then, the start address of the block, which corresponds to a block number in the storing region of the secondary storage, and the minimum value and the maximum value of the hash function values corresponding to the records in the block, are stored in the auxiliary information output buffer 19 as an auxiliary information record. The contents of the auxiliary information output buffer 19 are output to the auxiliary information list 22.
Each time a record is output from the record buffer 14 to the hashed list output buffer 18, a record in the input buffer 13 is stored in the record buffer 14, and the operations described above are performed.
FIGS. 4 through 7 are flowcharts showing the hash process according to the first embodiment of the group-by processing system according to the present invention. In this example, when the record buffer 14 shown in FIG. 3 becomes full, a record is moved to the hashed list output buffer 18, and an empty region of the record buffer 14 is immediately filled with records from the input buffer 13. When the hashed list output buffer 18 becomes full, its contents are output to the hashed list 21 as data in one block, and auxiliary information is stored in the auxiliary information list output buffer 19. And the auxiliary information list output buffer 19 becomes full, its contents are output to the auxiliary information list 22.
When the process starts as shown in FIG. 4, a group of records are read from the hash list 11 to the input buffer 13 in step S1. In step S2, the leading record in the input buffer 13 is moved to an empty region in the record buffer 14. In step S3, a hash function value for a key value of the record is calculated. In step S4, the pointer of the record is stored in an entry of the hash table 15 which corresponds to the hash function value of the record. If a record having the same hash function value is already stored in the record buffer 14 and a pointer indicating the record is already stored in the hash table 15, the pointer indicating the newly entered record is stored in the entry of the hash table 15, and the pointer previously stored in the entry of the hash table 15 is moved into the entry of the link control table 16 corresponding to the entry in which the newly entered record is stored, thereby realizing the link to the newly entered record from the record already stored.
Then, it is determined in step S5 whether or not there is an empty region in the record buffer 14. If there is an empty region, it is determined in step S7 whether or not there is a record in the input buffer 13. If yes, the process goes back to step S2.
If it is determined in step S5 that there is no empty region in the record buffer, then an output process for moving the record in the record buffer 14 to the hashed list output buffer 18 is performed in step S6. After the output process has been completed, control is passed to the process in step S7. The details of the output process are explained by referring to FIG. 5. If it is determined in step S7 that there are no records in the input buffer 13, then it is determined in step S8 whether or not there is a record in the hash list 11. If yes, a group of records are read from the hash list 11 to the input buffer 13 in step S9, and the process goes back to step S2.
If it is determined in step S8 that there are no records in the hash list 11, then it is determined in step S10 whether or not there are any records remaining in the record buffer 14. If yes, the process goes back to step S10 after the output process in step S11 is performed. If it is determined in step S10 that there are no records in the record buffer 14, then the process terminates after the terminating process in step S12 is performed. The details of the terminating process are described by referring to FIG. 6.
FIG. 5 is a flowchart showing the output process in steps S6 and S11 shown in FIG. 4. As shown in FIG. 5, after starting the output process, in step S16, whether or not a record indicated by the pointer stored in the entry of the hash table 15 designated by the output hash value memory 17 exists in the record buffer 14 is determined. If there is no record indicated in the record buffer 14, in step S17, the output hash function value in the output hash value memory 17 is incremented. Then, whether or not the output hash function value exceeds the maximum value of the hash function value is determined in step S18, and the process goes back to step S16 if the output hash function value does not exceed the maximum value. For example, when the hash function value is set to be the remainder from the division of the key value of the record by twenty (20), the maximum value is nineteen (19), and whether the output hash function value exceeds nineteen is determined in step S18.
If the output hash function value exceeds the maximum value of the hash function value in step S18, an output buffer output process, in which all of the records stored in the hashed list output buffer 18 are output to the hashed list 21 as a block in step S19, in order to maintain the order of the records in the block according to the hash function value, is performed. Then, the output hash function value is reset to zero in step S20 and the process goes back to step S16. The output buffer output process of step S19 will be described in detail referring to FIG. 7.
If it is determined that the record indicated by the pointer which is in the entry designated by the output hash function value, exists in the record buffer 14 in step S16, the record is moved into the hash list output buffer 18 from the record buffer 14 in step S21. If a record linked to the record moved into the hashed list output buffer 18 exists in the record buffer 14, the pointer indicating the record linked to the output record is stored in the entry corresponding to the present hash function value using the link control table 16. After that, whether or not the hashed list output buffer 18 is full is determined in step S22. If the hashed list output buffer 18 is not full in step S22, the process goes to step S7 or S10 shown in FIG. 4. If the hashed list output buffer 18 is full, after the output buffer process of step S23, the process also goes to step S7 or S10 shown in FIG. 4.
FIG. 6 is a flowchart showing the process in step S12 shown in FIG. 4, that is, the terminating process. When the process starts as shown in FIG. 6, it is first determined in step S26 whether or not there is a record in the hashed list output buffer 18. If there is a record, an output buffer output process for outputting the record to the hashed list 21 in step S27 is performed, and then the process in step S28 is performed. If there are no records remaining in the hashed list output buffer 18, control is immediately passed to step S28. In step S28, it is determined whether or not there is a record remaining in the auxiliary information list output buffer 19. If yes, the contents are output to the auxiliary information list 22 in step S29, and control is returned to the process shown in FIG. 4. If not, control is immediately returned to the processes shown in FIG. 4.
FIG. 7 shows the output buffer output process of step S19, step S23 and step S27. After starting the output buffer output process, the auxiliary information record is made and stored in the auxiliary information output buffer 19 in step S31. The auxiliary information record includes the start address of a block in the hashed list 21 in which the records output from the hashed list output buffer 18 are stored, and the minimum value and maximum value of the hash function values of the records in the block. The minimum value and the maximum value correspond to the hash function values of the first record and the last record in the block, respectively. Next, whether or not the auxiliary information output buffer 19 is full is determined in step S32. If the auxiliary information output buffer 19 is full in step S32, all of the auxiliary information records in the auxiliary information output buffer 19 are output to the auxiliary information list 22 (step S33), and the process goes to step S34. If the auxiliary information output buffer 19 is not full in step S32, the process directly goes to step S34, in which the contents of the hashed list output buffer 18 are output to the hashed list 21. After the step S34, the process returns to one of the steps of step S20 shown in FIG. 5, steps S7 or S10 shown in FIG. 4, or step S28 shown in FIG. 6.
FIGS. 8A through 8O show the steps of the hash process according to the flowcharts shown in FIGS. 4 through 7 performed on a group of records to be processed in a group-by operation described by referring to FIGS. 165A through 165F. In FIG. 8, the hash table shows the hash function values 0, 1, and 2 sequentially from the top. The mark * on the left indicates the contents (which hash function value) of the output hash value memory 17 shown in FIG. 3.
FIG. 8B shows the first record which is in the group of records to be processed in a group-by operation shown in FIG. 8A and is stored in the record buffer 14. That is, the hash function value for the first record of 5 is 2, and a pointer pointing to the record `5` stored in the record buffer 14 from the entry of 2 in the hash table is connected.
FIG. 8C shows the state in which the next record `3` is connected to the entry of the value 0 of the hash function. FIG. 8D shows the state in which the next record `3` is further connected to the entry of the hash function value 0 in the hash table. Thus, if there are a plurality of records pointed to from one entry of a hash table, the link among the plurality of records can be managed by the link control table 16 shown in FIG. 3. The details are described in a previous patent application by the Applicant as follows.
Japanese Laid open Patent Publication Hei 8129551 (published on May 21, 1996)
Title of the Invention: Hash System
FIG. 8E shows the state in which the next record `4` is pointed to from the entry having the hash function value of 1 in the hash table. As described above, if the record buffer 14 has the capacity for four records, then the record buffer is full at this point.
As shown in FIG. 8F, one record, that is, 3 in this example, is output from the record buffer 14 to the hashed list output buffer 18. The record output at this point has a hash function value, that is, 0 in this example, as the contents of the output hash value memory 17.
FIG. 8G shows the state in which the next record `2` is connected to the entry of 2 in the hash table. Since the record buffer 14 is full and the value of the output hash value memory 17 is 0 at this point, the record `3` connected to the entry of 0 in this example is output to the hashed list output buffer 18.
FIG. 8H shows the state in which the next record `1` is connected to the entry of 1 in the hash table. At this time, a record to be output to the hashed list output buffer 18 is a record pointed to from the entry of 1 in the hash table, that is, the record `1` which has just been stored in the record buffer 14 from the input buffer 13. This is caused by incrementing the contents of the output hash value memory 17 because no records are connected to the entry of the value of the output hash value memory 17, that is 0, after the record `3` connected to the entry is output at the point shown FIG. 8G.
The subsequent processes are similarly performed. Each time the hashed list output buffer 18 shown in FIG. 3 becomes full, the contents are output to the hashed list 21 in the secondary storage device. FIG. 9 shows the contents of the hashed list finally generated in the secondary storage device. Two records are stored for block number 2 because the output buffer output process, that is, the process in step S19 is performed when it is determined in step S18 shown FIG. 5 that the output hash value has exceeded the maximum hash function value. The auxiliary information list corresponding to FIG. 9 indicates [0,0,1], [1,1,2], [2,2,2], and [3,0,1].
The hash process according to the second embodiment of the group-by processing system is described above by explaining the last half of the process according to the first embodiment of the group-by processing system, that is, the group-by operation using a result of the hash process. FIG. 10 shows the hash process according to the second embodiment. When it is compared with FIG. 3 showing the first embodiment of the present invention, it is different in that an output block counter 20 is provided in the processing unit 12. In the first embodiment, an auxiliary information record includes a block number, a minimum hash function value and a maximum hash function value of a record in the block. In the second embodiment, an auxiliary information record includes a block number, a minimum hash function value of a record in the block, and a generation order number of the auxiliary information record, that is, a generation order number of an output block. The contents and the sorting of auxiliary information records are described later.
The flowchart of the hash process according to the second embodiment is almost the same as FIGS. 4 through 7, but a part of the output buffer output process in FIG. 7 is different as shown in FIG. 11. In FIG. 11, an auxiliary information record is generated in step S36 with a set of a hash function value of the leading record of the hashed list output b |