Access augmentation or optimizing

Methods of organizing data and processing queries in a database system, and database system and software product for implementing such methods

6711563

Abstract

A reference table has columns associated with data attributes and rows containing related words assigned to those attributes in a collection of data, those words coming from different data tables having independent numbers of records. The stored data include word thesauruses associated with the attributes, and reference table row identifier lists respectively associated with thesaurus entries. Each word thesaurus associated with an attribute has a respective entry for each word assigned to this data attribute in the collection of data. The reference table, which may be a virtual table, defines a unified algebraic framework for the entries of all the thesauruses. Query criteria can be examined with reference to the relevant thesauruses to obtain a row-ID list or bitmap vector which represents all the reference table rows matching the query criteria, if any. The results can then be delivered through the original data tables or, preferably, by means of the thesauruses.


Claims

What is claimed is:

1. A method of organizing information in a database system, wherein a group of attributes is defined and words of a collection of data are assigned to said attributes, wherein the group of attributes is divided into a plurality of sub-groups each associated with a respective data table, each data table having a column for each attribute of the associated sub-group and rows for containing data table records comprising at least one word assigned to an attribute of the associated sub-group, wherein links are defined between the data tables records, each link having a target table and a corresponding source table having a link column containing link values each designating a record of said target table, whereby each of said link values represents a link between the record of the source table including said link value and the record of the target table designated by said link value, the method comprising the steps of:

allocating respective identifiers to data graphs, wherein each data graph represents related attribute values respectively assigned to the attributes of said group, wherein each attribute value of a data graph is either a default value or a word of said collection of data, and wherein the words of each data graph are from linked data table records;

storing a plurality of word thesauruses respectively associated with attributes of said group, wherein for each word assigned at least once to an attribute in the collection of data, the word thesaurus associated with said attribute has a respective entry containing said word; and

storing data representing data graph identifier lists respectively associated with the word thesaurus entries, wherein the data graph identifier list associated with a thesaurus entry relating to a word assigned to an attribute includes any identifier allocated to a data graph having said word assigned to said attribute.

2. A method according to claim 1, wherein the entries of each word thesaurus associated with an attribute of said group are sorted based on the words assigned to said attribute.

3. A method according to claim 1, further comprising the step of storing a link table having a plurality of rows respectively associated with the data graphs and a plurality of columns respectively associated with the attribute sub-groups, wherein each row of the link table contains, in each one of the columns, either a value indicating that each attribute value represented in the data graph associated with said row and assigned to an attribute of the sub-group associated with said one of the columns is a default value or a link value for retrieving at least one stored word of the collection of data represented in the data graph associated with said row and assigned to an attribute of the sub-group associated with said one of the columns.

4. A method according to claim 3, wherein each link value contained in the column of the link table associated with an attribute sub-group comprises data for identifying a row of the data table associated with said sub-group.

5. A method according to claim 3, wherein the link table has at least one column associated with a sub-group consisting of one attribute, each link value contained in said column being a word assigned to said one attribute in the collection of data.

6. A method according to claim 3, wherein the link table has at least one column associated with a sub-group consisting of one attribute, each link value contained in said column being a pointer to an entry of the word thesaurus associated with said one attribute.

7. A method according to claim 1, wherein each word thesaurus associated with an attribute of the group to which the default value is assigned in at least one of the data graphs further has an entry for the default value, whereby one of said data graph identifier lists is associated with said thesaurus entry for the default value and includes any identifier allocated to a data graph having said default value assigned to said attribute.

8. A method according to claim 7, wherein the step of allocating identifiers to the data graphs comprises:

for any target table record which is not designated in the link column of the corresponding source table, inserting a dummy record in an additional row of said source table, having a link value designating said target table row in the link column and a default value in each attribute column, such insertion of dummy records being repeated until any record of any target table is designated in the link column of the corresponding source table;

initializing the data representing data graph identifier lists, to represent empty lists; and

processing the records of at least one root table which is not the target table for any link, the processing of each root table record comprising:

constructing a respective data graph consisting of related attribute values of said root table record and of corresponding records of the other data tables successively determined on the basis of the contents of the link columns in said records;

allocating an available identifier to said data graph; and

for each attribute, selecting the entry of the associated word thesaurus relating to the value assigned to said attribute in said data graph and updating the data representing the data graph identifier list associated with the selected entry to add the identifier allocated to said data graph to said list.

9. A method according to claim 8, wherein the processing of each root table record further comprises:

generating a row in a link table having a plurality of columns respectively associated with the data tables, wherein said row of the link table contains, in each column associated with a data table, either a value indicating that the data graph has been constructed from a dummy record with respect to said data table or a link value designating a row of said data table where the attribute values represented in the data graph are stored, said link table being used for retrieving from the data tables words of the collection of data represented in the data graphs.

10. A method according to claim 9, further comprising the step of deleting the dummy records from the data tables.

11. A method according to claim 9, further comprising the step of deleting the link columns from the data tables.

12. A method according to claim 1, further comprising the step of storing at least one macroword thesaurus associated with an attribute of said group and with a prefix length shorter than a word length of said attribute, having a respective entry for each prefix value having said prefix length and matching a corresponding prefix of at least one word assigned at least once to said attribute in the collection of data, said entry of the macroword thesaurus being associated with a data graph identifier list including each identifier allocated to a data graph having, assigned to said attribute, a word whose corresponding prefix matches said prefix value.

13. A method according to claim 12, wherein at least one attribute has a plurality of stored macroword thesauruses, associated with different prefix lengths.

14. A method according to claim 12, wherein the entries of each macroword thesaurus associated with an attribute are sorted based on the prefix values, and the entries of the word thesaurus associated with said attribute are sorted based on the words assigned to said attribute.

15. A method according to claim 14, wherein at least one attribute has Q stored macroword thesauruses associated with respective prefix lengths shorter than a word length of said attribute, Q being a positive integer, each of said Q macroword thesauruses having a thesaurus level parameter q such that 1.ltoreq.q.ltoreq.Q, the prefix length being a decreasing function of the level parameter q, the word thesaurus associated with said attribute having a level parameter q=0, wherein each macroword thesaurus having a level parameter q.gtoreq.1 contains, in each entry provided for a level q prefix value, data designating the entry of the level q-1 thesaurus which corresponds to the lowest or highest level q-1 macroword or word whose corresponding prefix matches said level q prefix value.

16. A method according to claim 1, wherein at least one thesaurus entry comprises data for pointing to at least one memory region where coding data representing the data graph identifier list associated with said entry are stored.

17. A method according to claim 16, wherein said memory region comprises a file individually allocated to said thesaurus entry.

18. A method according to claim 17, wherein said data for pointing to at least one memory region comprise the attribute value for which said thesaurus entry is provided, said attribute value being part of a respective file name used for accessing each file allocated to said thesaurus entry.

19. A method according to claim 16, wherein said memory region comprises a portion of a data container having a plurality of records each having a next address field, said portion being defined as a record chain in said data container, the chains being defined by means of the next address fields, and wherein said data for pointing to at least one memory region comprise a respective address of a first record of a chain in said data container.

20. A method according to claim 19, further comprising the step of grouping the records stored in the data container so that the records of each chain have contiguous addresses.

21. A method according to claim 19, wherein said data container is individually allocated to a thesaurus.

22. A method according to claim 19, wherein said data container is shared by a plurality of thesauruses.

23. A method according to claim 19, wherein the thesaurus associated with an attribute of said group has an index register where the thesaurus entries are sorted based on the words assigned to said attribute, each entry including a word index for pointing to a row of an auxiliary table, and wherein each row of the auxiliary table contains an address in the data container of a first record of a chain.

24. A method according to claim 19, wherein said data for pointing to at least one memory region comprise a respective address of a last record of a chain in said data container.

25. A method according to claim 16, wherein said coding data representing a data graph identifier list comprises integers respectively equal to the identifiers of said list.

26. A method according to claim 1, wherein a plurality of formats are provided for representing the data graph identifier lists, and wherein each thesaurus entry contains an indication of the format used for representing the data graph identifier list associated therewith.

27. A method according to claim 1, wherein said data representing data graph identifier lists comprise, for at least one thesaurus entry, coding data obtained by a coding scheme having n successive coding layers, n being a number at least equal to 1, each layer having a predetermined patter for dividing a range covering integers of an input list of said layer into subsets, said identifier list being the input list of the first layer for said thesaurus entry, wherein for any layer other than the last layer, an integer list representing the position, in the pattern of said layer, of each subset containing at least one integer of the input list forms the input list for the next layer, and wherein said coding data comprise, for each layer and each subset containing at least one integer of the input list, data representing the position of each integer of the input list within said subset and, at least if said layer is the last layer, data representing the position of said subset in the pattern of said layer.

28. A method according to claim 27, wherein the coding data are stored in a plurality of files including files respectively allocated to thesaurus entries.

29. A method according to claim 27, wherein the coding data are stored in a plurality of files including at least one file allocated to a respective thesaurus, for containing the coding data relating to the entries of said thesaurus.

30. A method according to claim 27, wherein the coding data are stored in at least one file allocated to a plurality of thesauruses, for containing the coding data relating to the entries of said plurality of thesauruses.

31. A method according to claim 27, wherein the pattern of each layer is such that the integer subsets are consecutive intervals consisting of the same number of integers.

32. A method according to claim 31, wherein said number of integers is a whole power of 2 for each layer.

33. A method according to claim 27, wherein the data representing the position of each integer of the input list of each layer within a subset consist of a bitmap segment in which each bit is associated with a respective integer of the subset to indicate whether said integer belongs to the input list of said layer.

34. A method according to claim 27, wherein the position of each subset in the layer n pattern is represented by an integer rank which is included in the coding data, in association with the data representing the position of the integers of the input list of layer n within said subset.

35. A method according to claim 27, wherein the position of a subset in the pattern of each layer is represented by an integer rank which is included in the coding data for said layer, in association with the data representing the position of the integers of the input list of said layer within said subset.

36. A method according to claim 27, wherein n.gtoreq.2 and layer k data containers each having a plurality of records are provided in a computer memory for 1.ltoreq.k.ltoreq.n, each record of a layer k data container being associated with a layer k integer rank representing the position of a subset in the layer k pattern, and wherein each record of a layer k data container associated with a layer k rank representing the position of a subset in the layer k pattern has a first field for containing data for retrieving the position within said subset of any integer of a layer k input list relating to a data graph identifier list, whereby a combination of said layer k rank with any position retrievable from the data contained in said first field determines a layer k-1 rank with which a respective record of the layer k-1 data container is associated if k>1, and an identifier of said data graph identifier list if k=1.

37. A method according to claim 36, wherein, for 1.ltoreq.k.ltoreq.n, said data contained in the first field of a record of the layer k data container for retrieving the position of any integer of a layer k input list within a subset comprise a bitmap segment in which each bit is associated with a respective integer of said subset to indicate whether said integer belongs to said layer k input list.

38. A method according to claim 37, wherein, for 1.ltoreq.k.ltoreq.n, each record of the layer k data container associated with a layer k rank further has a second field for containing said layer k rank.

39. A method according to claim 38, wherein each data container comprises at least two files where the first and second fields of the records of said data container are respectively stored, said files being accessible separately.

40. A method according to claim 36, wherein, for 1.ltoreq.k<n, each record of the layer k data container further has a second field for containing a number representing the position of an integer of a layer k+1 input list within a subset of the layer k+1 pattern,

and wherein, for 1<k.ltoreq.n, said data contained in the first field of a record of the layer k data container associated with a layer k rank for retrieving the position of any integer of a layer k input list within a subset of the layer k pattern comprise a pointer to at least one record of the layer k-1 data container in which the second field contains a number representing the position of an integer of said layer k input list within said subset of the layer k pattern, whereby said record of the layer k-1 data container is associated with the layer k-1 rank determined by the combination of said layer k rank with the position represented by said number.

41. A method according to claim 40, wherein said data contained in the first field of a record of the layer 1 data container for retrieving the position of any integer of a data graph identifier list within a subset comprise a bitmap segment in which each bit is associated with a respective integer of said subset to indicate whether said integer represents a data graph identifier of said list.

42. A method according to claim 40, wherein each record of the layer n data container associated with a layer n rank further has a second field for containing said layer n rank.

43. A method according to claim 40, wherein each layer k data container for 1<k<n comprises at least two files where the first and second fields of the records of said data container are respectively stored, said files being accessible separately.

44. A method according to claim 36, wherein, for 1.ltoreq.k.ltoreq.n, each record of the layer k data container further has a next address field, whereby record chains are defined in the layer k data container by means of the next address fields, and wherein at least some of the thesaurus entries are respectively associated with record chains in the layer n data container, whereby the coding data relating to one of said entries for layer n are stored in or retrievable from the record chain associated therewith in the layer n data container.

45. A method according to claim 44, wherein, for 1.ltoreq.k<n, said thesaurus entries are respectively associated with record chains in the layer k data container, whereby the coding data relating to one of said entries for layer k are stored in or retrievable from the record chain associated therewith in the layer k data container.

46. A method according to claim 44, wherein, for 1<k.ltoreq.n, each record of the layer k data container further has a head address field for pointing to an address of a first record of a respective chain in the layer k-1 data container.

47. A method according to claim 44, wherein each layer k data container for 1.ltoreq.k.ltoreq.n comprises at least two files where the first fields and the next address fields of the records of said data container are respectively stored, said files being accessible separately.

48. A method according to claim 1, wherein an integer range covering the identifiers allocated to the data graphs is partitioned into a plurality of predetermined portions, wherein at least some of the data representing identifier lists are distributed into a plurality of storage sections respectively associated with said portions, wherein a storage section associated with one of said portions contains data representing identifier sub-lists consisting of identifiers of said portion.

49. A method according to claim 48, wherein a respective storage unit is provided for each of said portions of the data graph identifier range, to receive the storage sections associated with said portion.

50. A method according to claim 49, wherein at least some of the word thesauruses have a plurality of sections respectively associated with said portions, wherein a section, associated with one of said portions, of a word thesaurus associated with an attribute has a respective entry for each word assigned to said attribute in a data graph to which an identifier of said portion is allocated, said entry containing data for retrieving an identifier sub-list from the storage section associated with said portion.

51. A method according to claim 49, further comprising the step of storing at least one macroword thesaurus associated with an attribute of said group and with a prefix length shorter than a word length of said attribute, having a respective entry for each prefix value having said prefix length and matching a corresponding prefix of at least one word assigned at least once to said attribute in the collection of data, said entry of the macroword thesaurus being associated with a data graph identifier list including each identifier allocated to a data graph having, assigned to said attribute, a word whose corresponding prefix matches said prefix value, wherein at least some of the macroword thesauruses have a plurality of sections respectively associated with said portions, wherein a section, associated with one of said portions, of a macroword thesaurus associated with an attribute and with a prefix length has a respective entry for each prefix value having said prefix length and matching a corresponding prefix of at least one word assigned to said attribute in a data graph to which an identifier of said portion is allocated, said entry containing data for retrieving an identifier sub-list from the storage section associated with said portion.

52. A method according to claim 48, wherein each thesaurus entry has a plurality of fields respectively associated with said portions, for containing data for retrieving respective identifier sub-lists from the storage sections.

53. A method of processing a query in a database system, wherein a group of attributes is defined and words of a collection of data are assigned to said attributes, the group of attributes being divided into a plurality of sub-groups respectively associated with a plurality of data tables having independent numbers of records, with links between respective records from the data tables, wherein identifiers are respectively allocated to data graphs, each data graph representing related attribute values respectively assigned to the attributes of said group, each attribute value of a data graph being either a default value or a word of said collection of data, wherein a plurality of thesauruses each associated with a respective attribute of said group and data representing data graph identifier lists respectively associated with entries of said thesauruses are stored, wherein each thesaurus associated with one attribute is defined with reference to a partition into subsets of a set of words which can be assigned to said one attribute and has a respective entry for each subset including at least one word assigned to said one attribute in the collection of data, the data graph identifier list associated with said thesaurus entry including any identifier allocated to a data graph having a word of said subset assigned to said one attribute, the method comprising the steps of:

analyzing query criteria to determine a combination involving thesaurus entries relevant to the query;

determining a matching data graph identifier list based on said combination and on the stored data representing the data graph identifier lists associated with said relevant thesaurus entries;

processing said matching data graph identifier list to output a response.

54. A method according to claim 53, wherein the step of analyzing the query criteria comprises, for at least one attribute referred to in said criteria:

selecting at least one range of words defined for said attribute in the query criteria; and

mapping the words of the selected range which are assigned to said attribute in the collection of data with one or more subsets, the thesaurus entry for each of said one or more subset being retained as a relevant entry for the selected range,

and wherein the step of determining the matching data graph identifier list comprises merging respective portions of the identifier lists represented by the data of the relevant thesaurus entries retained for said selected range.

55. A method according to claim 54, wherein the mapping is performed so as to retain a minimum number of relevant thesaurus entries for each selected range.

56. A method according to claim 54, wherein each thesaurus associated with an attribute is defined with reference to a partition such that each subset consists of one word or of consecutive words of the set of words which can be assigned to said attribute, the entries of said thesaurus being sorted based on the words assigned to said attribute, and wherein the step of analyzing the query criteria comprises at least one dichotomy search in at least one thesaurus for identifying relevant thesaurus entries.

57. A method according to claim 56, wherein the thesauruses comprise word thesauruses each associated with a respective attribute of the group, with reference to a partition into subsets each consisting of one word.

58. A method according to claim 57, wherein each word thesaurus associated with an attribute of the group to which the default value is assigned in at least one of the data graphs further has an entry for the default value, whereby one of said data graph identifier lists is associated with said thesaurus entry for the default value and includes any identifier allocated to a data graph having said default value assigned to said attribute.

59. A method according to claim 57, wherein the thesauruses further comprise at least one macroword thesaurus associated with an attribute of said group and with a prefix length shorter than a word length of said attribute, whereby said macroword thesaurus is defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length.

60. A method according to claim 59, wherein at least one attribute of said group has a plurality of macroword thesauruses, associated with different prefix lengths.

61. A method according to claim 54, wherein the step of analyzing the query criteria comprises determining said combination involving relevant thesaurus entries as a tree having at least one leaf node, each leaf node corresponding to at least one relevant thesaurus entry retained for a respective attribute.

62. A method according to claim 61, wherein said tree has a plurality of nodes including said at least one leaf node and at least one operator node, each operator node representing a Boolean operator applied to at least one partial criterion represented by another node of said tree, one of the operator nodes being a root node representing all the query criteria.

63. A method according to claim 62, wherein the nodes of said tree further include at least one preset node for which a data graph identifier list has been determined prior to said step of analyzing the query criteria.

64. A method according to claim 63, wherein the data graph identifier list of said preset node is determined from at least one matching data graph identifier list obtained when processing a previous query.

65. A method according to claim 62, wherein the step of determining the matching data graph identifier list comprises obtaining a respective identifier list for each node of said tree, whereby the identifier list obtained for each leaf node corresponding to at least one relevant thesaurus entry is the merger of respective portions of the identifier lists associated with said at least one relevant thesaurus entry, and the identifier list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the identifier lists obtained for the node representing said at least one partial criterion, said matching data graph identifier list being determined as the identifier list obtained for the root node.

66. A method according to claim 65, wherein each of said obtained identifier lists is produced in the form of a bitmap vector consisting of bits assigned to respective data graphs to indicate whether the identifiers allocated to said data graphs belong to said obtained list.

67. A method according to claim 65, wherein a coding scheme comprising n successive coding layers is used to provide coding data representing the identifier list associated with a thesaurus entry, n being a number at least equal to 1, each layer having a predetermined pattern for dividing a range covering integers of an input list of said layer into subsets, said identifier list being the input list of the first layer for said thesaurus entry, wherein for any layer other than the last layer, an integer list representing the position, in the patter of said layer, of each subset containing at least one integer of the input list forms the input list for the next layer,

and wherein the coding data comprise, for each layer and each subset containing at least one integer of the input list, data representing the position of each integer of the input list within said subset and, at least if said layer is the last layer, data representing the position of said subset in the pattern of said layer.

68. A method according to claims 67, wherein the patter of each layer is such that the integer subsets are consecutive intervals consisting of the same number of integers.

69. A method according to claim 68, wherein said number of integers is a whole power of 2 for each layer.

70. A method according to claim 67, wherein said data representing the position of an integer of an input list within a subset consist of a bitmap segment.

71. A method according to claim 67, wherein the step of determining the matching data graph identifier list comprises determining a layer n integer list for each node of said tree, whereby the layer n integer list determined for a leaf node consists of a layer n input list associated, in the coding scheme, with the merger of the identifier lists represented in the relevant thesaurus entries to which said leaf node corresponds, and whereby the layer n integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer n integer lists determined for the nodes representing said at least one partial criterion, and wherein a layer n result list is determined as the layer n integer list obtained for the root node.

72. A method according to claim 71, wherein the nodes of said tree further include at least one preset node for which a data graph identifier list has been determined prior to said step of analyzing the query criteria, said data graph identifier list being subjected to the coding scheme to provide a layer n input list which is determined as said layer n integer list for said preset node.

73. A method according to claim 71, wherein, in the coding scheme, the coding data representing the position of each integer of an input list within a subset for the coding layer n define a layer n bitmap segment in which each bit is associated with a respective integer of the subset to indicate whether said integer belongs to said input list, while the data representing the position of said subset in the layer n pattern comprise a layer n integer rank associated with said layer n bitmap segment, and wherein the step of determining a layer n integer list for a leaf node comprises:

initializing a layer n bitmap vector with logical zeroes;

obtaining the layer n ranks and associated bitmap segments from the coding data for each relevant thesaurus entry to which said leaf node corresponds; and

for each of said layer n ranks, superimposing the layer n bitmap segment associated therewith onto a segment of said layer n bitmap vector having a position determined by said layer n rank, the superimposition being performed according to a bitwise Boolean OR operation,

said layer n list for the leaf node corresponding to the resulting layer n bitmap vector.

74. A method according to claim 71, wherein n>1 and the step of determining the matching data graph identifier list further comprises, for k decreasing from n-1 to 1, determining a layer k integer list for each node of said tree, whereby the layer k integer list determined for a leaf node consists of any integer of a layer k input list, associated in the coding scheme with the identifier list represented in a relevant thesaurus entry to which said leaf node corresponds, which belongs to a layer k subset whose position is represented in the layer k+1 result list, and whereby the layer k integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer k integer lists determined for the nodes representing said at least one partial criterion, wherein a layer k result list is determined as the layer k integer list obtained for the root node, and wherein said matching data graph identifier list corresponds to the determined layer 1 result list.

75. A method according to claim 74, wherein the nodes of said tree further include at least one preset node for which a data graph identifier list has been determined prior to said step of analyzing the query criteria, said data graph identifier list being subjected to the coding scheme to provide a layer k input list which is determined as said layer k integer list for said preset node.

76. A method according to claim 74, wherein, in the coding scheme, the coding data representing the position of each integer of an input list within a subset for a coding layer k<n define a layer k bitmap segment in which each bit is associated with a respective integer of the subset to indicate whether said integer belongs to said input list, while the coding data further comprise a layer k integer rank associated with said layer k bitmap segment to represent the position of said subset in the layer k pattern, and wherein the step of determining a layer k integer list for a leaf node comprises:

initializing a layer k bitmap vector with logical zeroes;

obtaining the layer k ranks from the coding data for each relevant thesaurus entry to which said leaf node corresponds; and

selecting any obtained layer k rank belonging to the layer k+1 result list and superimposing the associated layer k bitmap segment onto a segment of said layer k bitmap vector having a position determined by the selected layer k rank, the superimposition being performed according to a bitwise Boolean OR operation,

said layer k list for the leaf node corresponding to the resulting layer k bitmap vector.

77. A method according to claim 76, wherein, for 1.ltoreq.k<n, the layer k ranks and the layer k bitmap segments associated therewith for at least one thesaurus entry are stored at corresponding addresses in distinct first and second files, and wherein the step of determining a layer k integer list for a leaf node comprises:

providing a rank table in a RAM memory, having records associated with the addresses in said first and second files;

filling the rank table by writing any selected layer k rank into the rank table record associated with the address of the selected layer k rank in said first file; and

for any record of the filled rank table containing a layer k rank and associated with an address in the second file, reading the associated layer k bitmap segment at said address in the second file and superimposing the read layer k bitmap segment onto a segment of said layer k bitmap vector having a position determined by said layer k rank.

78. A method according to claim 74, wherein the step of determining the matching data graph identifier list further comprises, for any coding layer k such that 1<k.ltoreq.n, determining a layer k' filtering list for k.ltoreq.k'.ltoreq.n consisting of the layer k' input list obtained by providing the layer k result list as an input list in layer k of the coding scheme,

wherein, in the coding scheme, the coding data representing the position of each integer of an input list within a subset for a coding layer k<n define a layer k bitmap segment in which each bit is associated with a respective integer of the subset to indicate whether said integer belongs to said input list, while a layer k integer rank associated with said layer k bitmap segment represents the position of said subset in the layer k pattern, and wherein the step of determining a layer k integer list for a leaf node for k<n comprises:

/a/ initializing a layer k bitmap vector with logical zeroes;

/b/ selecting the layer n ranks obtained from the coding data for each relevant thesaurus entry to which said leaf node corresponds, and setting k'=n;

/c/ for each selected layer k' rank:

/c1/ if the selected layer k' rank represents the position in the layer k' pattern of a subset which includes at least one integer of the layer k' filtering list, obtaining the layer k' bitmap segment with which the selected layer k' rank is associated;

/c2/ for any integer of the layer k' filtering list whose position within said subset is represented in said layer k' bitmap segment, selecting a respective layer k'-1 rank determined from the selected layer k' rank and said position represented in said layer k' bitmap segment;

/c3/ if k'>k+1, executing step /c/ with k' decremented by one unit; and

/c4/ if k'-1 =k, obtaining any layer k bitmap segment with which a selected layer k'-1 rank is associated, and superimposing said layer k bitmap segment onto a segment of said layer k bitmap vector having a position determined by said selected layer k'-1 rank, the superimposition being performed according to a bitwise Boolean OR operation,

said layer k list for the leaf node corresponding to the resulting layer k bitmap vector.

79. A method according to claim 78, wherein, for 1.ltoreq.k<n, the layer k bitmap segments for at least one thesaurus entry are stored in at least one layer k file at addresses respectively corresponding to the layer k ranks associated therewith, and wherein, for 1.ltoreq.k<n, the step of determining a layer k integer list for a leaf node comprises:

providing a rank table in a RAM memory, having records associated with the addresses in said layer k file;

filling the rank table by writing any selected layer k rank into the rank table record associated with the address corresponding to the selected layer k rank; and

for any record of the filled rank table containing a layer k rank and associated with an address in said layer k file, reading the associated layer k bitmap segment at said address and superimposing the read layer k bitmap segment onto a segment of said layer k bitmap vector having a position determined by said layer k rank.

80. A method according to claim 53, wherein a link table is stored, having a plurality of rows respectively associated with the data graphs and a plurality of columns respectively associated with the attribute sub-groups, wherein each row of the link table contains, in each one of the columns, either a value indicating that each attribute value represented in the data graph associated with said row and assigned to an attribute of the sub-group associated with said one of the columns is a default value or a link value for retrieving at least one stored word of the collection of data represented in the data graph associated with said row and assigned to an attribute of the sub-group associated with said one of the columns, and wherein the step of processing the matching data graph identifier list comprises reading at least one value in any row of the link table associated with a data graph identified in the matching data graph identifier list.

81. A method according to claim 80, wherein said data tables are stored, wherein each link value contained in the column of the link table associated with an attribute sub-group comprises data for identifying a record of the data table associated with said sub-group, and wherein the step of processing the matching data graph identifier list further comprises reading at least part of any data table record identified by a link value read in a row of the link table.

82. A method according to claim 53, wherein the step of processing the matching data graph identifier list comprises, for at least one attribute specified in the query, selecting a thesaurus associated with said attribute and detecting entries of the selected thesaurus with which identifier lists having a non-empty intersection with the matching data graph identifier list are associated.

83. A method according to claim 82, wherein said attribute specified in the query has Q+1 stored thesauruses associated with different prefix lengths, Q being an integer at least equal to 0, each of said Q+1 thesauruses having a thesaurus level parameter q such that 0.ltoreq.q.ltoreq.Q, whereby the prefix length is a decreasing function of the level parameter q and corresponds to a word length of said attribute for q=0, wherein each of said Q+1 thesauruses is defined with reference to a respective partition into subsets each consisting of words beginning by a common prefix having the prefix length associated with said thesaurus, the entries of said thesaurus being sorted based on the prefix values.

84. A method according to claim 83, wherein, the selected thesaurus having a level parameter QA.gtoreq.0, the detection of entries in the selected thesaurus comprises the steps of:

/a/ providing respective level q target lists and respective level q thesaurus ranges covering consecutive entries of the level q thesaurus for QA.ltoreq.q.ltoreq.Q;

/b/ initializing the level Q target list with the matching data graph identifier list, initializing the level parameter q with the value Q, and selecting a first entry of the level Q thesaurus range;

/c/ determining an intersection list between the level q target list and the identifier list associated with the selected entry of the level q thesaurus range;

/d/ if the intersection list determined in the preceding step /c/ is empty, selecting another entry of the level q thesaurus range and repeating step /c/;

/e/ if q is greater than QA:

/e1/ setting the level q-1 target list as equal to the intersection list determined in the preceding step /c/;

/e2/ setting the level q-1 thesaurus range as consisting of the entries of the level q-1 thesaurus relating to level q-1 prefixes which begin with the level q prefix of the selected level q thesaurus entry, and selecting a first entry of the level q-1 thesaurus range;

/e3/ decrementing q by one unit and returning to step /c/;

/f/ if q is equal to QA:

/f1/ including the selected level QA thesaurus entry in the detected entries;

/f2/ if the level Q target list is equal to the intersection list determined in the preceding step /c/, terminating the detection of entries in the selected thesaurus;

/f3/ removing the integers of the intersection list determined in the preceding step /c/ from any target list including at least one integer which is not in said intersection list;

/f4/ setting q as the smallest level parameter for which the target list includes at least one integer which is not in said intersection list;

/f5/ selecting another entry in the level q thesaurus range and returning to step /c/.

85. A method according to claim 84, wherein Q.gtoreq.1 and each thesaurus having a level parameter q.gtoreq.1 further contains, in each entry provided for a level q prefix value, data designating the entry of the level q-1 thesaurus which corresponds to the lowest or highest level q-1 prefix beginning with the level q prefix of said level q thesaurus entry, and wherein step /e2/ comprises selecting the level q-1 thesaurus entry designated in the selected level q thesaurus entry.

86. A method according to claim 84, wherein a coding scheme comprising n successive coding layers is used to provide coding data representing the identifier list associated with a level q thesaurus entry for 0.ltoreq.q.ltoreq.Q, n being a number at least equal to 1, each layer having a predetermined pattern for dividing a range covering integers of an input list of said layer into subsets, said identifier list being the input list of the first layer for said thesaurus entry, wherein for any layer other than the last layer, an integer list representing the position, in the pattern of said layer, of each subset containing at least one integer of the input list forms the input list for the next layer,

wherein the coding data comprise, for each layer and each subset containing at least one integer of the input list, data representing the position of each integer of the input list within said subset and, at least if said layer is the last layer, data representing the position of said subset in the pattern of said layer,

and wherein each level q target list forms a layer 1 and level q filtering list and is submitted as a layer 1 input list in the coding scheme for QA.ltoreq.q.ltoreq.Q to provide respective layer k and level q filtering lists for 1<k.ltoreq.n if n>1, said layer k and level q filtering list provided from a level q target list being the layer k input list obtained from said level q target list in the coding scheme.

87. A method according to claim 86, wherein the pattern of each layer is such that the integer subsets are consecutive intervals consisting of the same number of integers.

88. A method according to claim 87, wherein said number of integers is a whole power of 2 for each layer.

89. A method according to claim 86, wherein the step /c/ of determining the intersection list between a level q target list and an identifier list comprises, from k=n:

/c1/ computing a layer k intersection list between the layer k input list obtained from said identifier list in the coding scheme and the layer k and level q filtering list corresponding to said level q target list;

/c2/ if the computed layer k intersection list is empty, determining said intersection list between the level q target list and the identifier list as being empty;

/c3/ if k=1, determining said intersection list between the level q target list and the identifier list as the computed layer 1 intersection list; and

/c4/ if k>1, decrementing k by one unit and repeating from step /c1/.

90. A method according to claim 89, wherein, in the coding scheme, the coding data representing the position of each integer of an input list within a subset for a coding layer k.ltoreq.n define a layer k bitmap segment in which each bit is associated with a respective integer of the subset to indicate whether said integer belongs to said input list, while the data representing the position of said subset in the layer k pattern comprise a layer k integer rank associated with said layer k bitmap segment, and wherein the step /c1/ of computing a layer k intersection list between a layer k input list obtained from an identifier list in the coding scheme and a layer k and level q filtering list, represented by a first layer k bitmap vector, comprises:

initializing a second layer k bitmap vector with logical zeroes;

obtaining layer k ranks from the coding data representing said identifier list; and

selecting any obtained layer k rank which represents the position in the layer k pattern of a subset including at least one integer of said layer k and level q filtering list, obtaining the layer k bitmap segment with which the selected layer k rank is associated, and determining a segment of the second layer k bitmap vector having a position determined by the selected layer k rank by combining the obtained layer k bitmap segment with a segment of the first layer k bitmap vector having a position determined by the selected layer k rank according to a bitwise Boolean AND operation,

said layer k intersection list corresponding to the resulting second layer k bitmap vector.

91. A method according to claim 90, wherein, for 1.ltoreq.k<n, the layer k ranks and the layer k bitmap segments associated therewith for at least one thesaurus entry are stored at corresponding addresses in distinct first and second files, and wherein the step /c1/ of computing a layer k intersection list between a layer k input list obtained from an identifier list in the coding scheme and a layer k and level q filtering list, represented by a first layer k bitmap vector, comprises:

providing a rank table in a RAM memory, having records associated with the addresses in said first and second files;

filling the rank table by writing any selected layer k rank into the rank table record associated with the address of said selected layer k rank in said first file; and

for any record of the filled rank table containing a layer k rank and associated with an address in the second file, reading the associated layer k bitmap segment at said address in the second file and combining the read layer k bitmap segment with a segment of the first layer k bitmap vector having a position determined by said layer k rank according to a bitwise Boolean AND operation to determine a segment of the second layer k bitmap vector having a position determined by said layer k rank.

92. A method according to claim 90, further comprising determining a pre-filtering flag for each entry of a level q thesaurus, said pre-filtering flag having a first value when said entry is associated with a data graph identifier list represented by coding data which do not define any layer n rank representing the position in the layer n pattern of a subset which includes at least one integer of a layer n and level q filtering list, and wherein the step /c/ of determining the intersection list between a level q target list, corresponding to said layer n and level q filtering list, and an identifier list associated with an entry of the level q thesaurus comprises determining said intersection list as being empty if the pre-filtering flag determined for said entry has said first value.

93. A method according to claim 92, wherein, for any level q thesaurus entry associated with a data graph identifier list represented by coding data which define a layer n rank representing the position in the layer n pattern of a subset which includes at least one integer of the layer n and level q filtering list, the layer n bitmap segment associated with said layer n rank is obtained and said first value is allocated to the pre-filtering flag determined for said entry if the obtained layer n bitmap segment does not represent the position of any integer of said layer n and level q filtering list within said subset.

94. A method according to claim 86, wherein n>1 and in the coding scheme, the coding data representing the position of each integer of an input list within a subset for a coding layer k.ltoreq.n define a layer k bitmap segment in which each bit is associated with a respective integer of the subset to indicate whether said integer belongs to said input list, while a layer k integer rank associated with said layer k bitmap segment represents the position of said subset in the layer k pattern, and wherein the step /c/ of determining the intersection list between a level q target list, corresponding to layer k and level q filtering lists represented by a respective first layer k bitmap vectors for 1.ltoreq.k.ltoreq.n, and an identifier list comprises:

/c1/ initializing a second bitmap vector with logical zeroes;

/c2/ selecting layer n ranks obtained from the coding data representing said identifier list, and setting k=n;

/c3/ for each selected layer k rank:

/c31/ if the selected layer k rank represents the position in the layer k pattern of a subset which includes at least one integer of said layer k and level q filtering list, obtaining the layer k bitmap segment with which the selected layer k rank is associated;

/c32/ for any integer of the layer k and level q filtering list whose position within said subset is represented in said layer k bitmap segment, selecting a respective layer k-1 rank determined from the selected layer k rank and said position represented in said layer k bitmap segment;

/c33/ if k>2, executing step /c3/ with k decremented by one unit; and

/c34/ if k=2, obtaining any layer 1 bitmap segment with which a selected layer 1 rank is associated, and combining the obtained layer 1 bitmap segment with a segment of the first layer 1 bitmap vector having a position determined by said layer 1 rank according to a bitwise Boolean AND operation to determine a segment of the second bitmap vector having a position determined by said layer 1 rank,

said intersection list corresponding to the resulting second bitmap vector.

95. A method according to claim 94, wherein the layer 1 bitmap segments for at least one thesaurus entry are stored in at least one layer 1 file at addresses respectively corresponding to the layer 1 ranks associated therewith, and the step /c/ of determining an intersection list comprises:

providing a rank table in a RAM memory, having records associated with the addresses in said layer 1 file;

filling the rank table by writing any layer 1 rank selected in step /c32/ into the rank table record associated with the address corresponding to the selected layer 1 rank; and

for any record of the filled rank table containing a layer 1 rank and associated with an address in said layer 1 file, reading the associated layer 1 bitmap segment at said address and combining the read layer 1 bitmap segment with a segment of the first layer 1 bitmap vector having a position determined by said layer 1 rank according to a bitwise Boolean AND operation to determine a segment of the second layer 1 bitmap vector having a position determined by said layer 1 rank.

96. A method according to claim 94, further comprising determining a pre-filtering flag for each entry of a level q thesaurus, said pre-filtering flag having a first value when said entry is associated with a data graph identifier list represented by coding data which do not define any layer n rank representing the position in the layer n pattern of a subset which includes at least one integer of a layer n and level q filtering list, and wherein the step /c/ of determining the intersection list between a level q target list, corresponding to said layer n and level q filtering list, and an identifier list associated with an entry of the level q thesaurus comprises determining said intersection list as being empty if the pre-filtering flag determined for said entry has said first value.

97. A method according to claim 96, wherein, for any level q thesaurus entry associated with a data graph identifier list represented by coding data which define a layer n rank representing the position in the layer n pattern of a subset which includes at least one integer of the layer n and level q filtering list, the layer n bitmap segment associated with said layer n rank is obtained and said first value is allocated to the pre-filtering flag determined for said entry if the obtained layer n bitmap segment does not represent the position of any integer of said layer n and level q filtering list within said subset.

98. A method according to claim 82, wherein the step of processing the matching data graph identifier list further comprises writing output data associated with any detected entry of a selected thesaurus into an output table.

99. A method according to claim 98, wherein the output table includes a respective row corresponding to each identifier of the matching data graph identifier list, and wherein output data associated with a detected entry of a selected thesaurus are written into any row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said detected thesaurus entry.

100. A method according to claim 99, wherein each data graph identifier has a respective row of the output table corresponding thereto, wherein the rows of the output table are initialized with a default value before writing the output data, and wherein the rows of the output table which do not contain the default value are read after writing the output data.

101. A method according to claim 99, wherein the output table is associated with an index file having a respective record for each data graph identifier, containing either a default value or a pointer designating a respective row of the output table corresponding to said data graph identifier, wherein the records of the index file are initialized with a default value before writing the output data, and wherein the step of writing output data associated with a detected entry of a first selected thesaurus comprises, for each data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said detected entry of the first selected thesaurus:

allocating an available row of the output table to correspond to said data graph identifier,

writing output data into the allocated row; and

writing a pointer to the allocated row into the record of the index file provided for said data graph identifier.

102. A method according to claim 99, wherein the output table has a plurality of columns each associated with a respective attribute for which a thesaurus is selected, and wherein output data associated with a detected entry of a thesaurus selected for an attribute associated with a column of the output table are written into said column.

103. A method according to claim 101, wherein the output table has a plurality of columns each associated with a respective attribute for which a thesaurus is selected, wherein output data associated with a detected entry of a thesaurus selected for an attribute associated with a column of the output table are written into said column, and wherein the step of writing output data associated with a detected entry of at least one second selected thesaurus comprises, for each data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said detected entry of the second selected thesaurus:

reading the pointer contained in the record of the index file provided for said data graph identifier; and

writing output data into the row of the output table designated by said pointer.

104. A method according to claim 98, wherein the selected thesaurus being a word thesaurus defined with reference to a partition into subsets each consisting of one word, the output data associated with a detected entry comprise the word for which said detected entry is provided.

105. A method according to claim 98, wherein the selected thesaurus being a macroword thesaurus associated with a prefix length and defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length, the output data associated with a detected entry comprise the prefix value for which said detected entry is provided.

106. A method according to claim 98, wherein the output data associated with a detected entry comprise an address of said detected entry in the selected thesaurus.

107. A method according to claim 98, wherein the output data associated with a detected entry of a selected thesaurus comprise a numerical value derived from said thesaurus entry.

108. A method according to claim 107, wherein, for a detected entry of at least one selected thesaurus, said numerical value is calculated by applying a mathematical function to a thesaurus value stored in said entry.

109. A method according to claim 107, wherein, for a detected entry of at least one selected thesaurus, said numerical value is calculated by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in the output table.

110. A method according to claim 107, wherein the output table includes a respective row corresponding to each identifier of the matching data graph identifier list, and wherein a numerical value derived from a detected thesaurus entry is written into any row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said detected thesaurus entry.

111. A method according to claim 110, wherein the numerical value, derived from a detected entry of a first selected thesaurus and written into any row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said entry of the first selected thesaurus, is obtained from a thesaurus value stored in said entry,

and wherein the numerical value, derived from a detected entry of at least one second selected thesaurus and written into a row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said entry of the second selected thesaurus, is calculated by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in said row of the output table.

112. A method according to claim 110, further comprising calculating an output value from a set of numerical values which have been respectively written into the rows of the output table.

113. A method according to claim 53, wherein an integer range covering the identifiers allocated to the data graphs is partitioned into a plurality of predetermined portions, wherein at least some of the data representing identifier lists are distributed into a plurality of storage sections respectively associated with said portions, wherein a storage section associated with one of said portions contains data representing identifier sub-lists consisting of identifiers of said portion,

and wherein the step of determining a matching data graph identifier list is executed separately for the different portions of the data graph identifier range, by means of the respective storage sections.

114. A method according to claim 113, wherein the step of processing the matching data graph identifier is at least partially executed separately for the different portions of the data graph identifier range, by means of the respective storage sections.

115. A method according to claim 113, wherein the thesauruses have a plurality of sections respectively associated with said portions, wherein a section, associated with one of said portions, of a thesaurus associated with an attribute and defined with reference to a partition into subsets has a respective entry for each subset of said partition which includes at least one word assigned to said attribute in a data graph to which an identifier of said portion is allocated, said entry containing data representing an identifier sub-list including each identifier of said portion allocated to a data graph having a word of said subset assigned to said attribute, and wherein the step of analyzing the query criteria is at least partially executed separately for the different portions of the data graph identifier range, by means of the respective thesaurus sections.

116. A method according to claim 113, wherein the separate step executions are carried out in parallel by respective processors for the different portions of the data graph identifier range.

117. A method according to claim 116, wherein each thesaurus entry has a plurality of fields respectively associated with said portions, for containing data for retrieving respective identifier sub-lists from the storage sections, wherein the step of analyzing the query criteria is executed centrally for all the portions of the data graph identifier range, and wherein the relevant thesaurus entries used by a processor executing the step of determining a matching data graph identifier list by means of a storage section are designated by the data for retrieving identifier sub-lists from said storage section.

118. A method according to claim 117, wherein the step of analyzing the query criteria is executed by a query server connected to said processors through a communication network.

119. A method according to claim 116, wherein a list update server is connected, through the communication network, to a plurality of storage units respectively coupled to said processors, the list update server controlling the storage units to maintain the storage sections.

120. A database system, for managing information from a plurality of data tables having different numbers of records, wherein the group of attributes is divided into a plurality of sub-groups respectively associated with the data tables, with links between respective records from the data tables, each data table having a column for each attribute of the associated sub-group and rows for containing data table records comprising at least one word assigned to an attribute of the associated sub-group in a collection of data, wherein respective identifiers are allocated to data graphs, each data graph representing related attribute values respectively assigned to the attributes of said group, each attribute value of a data graph being either a default value or a word of said collection of data, the words of each data graph being from linked data table records, the database system comprising:

means for storing a plurality of thesauruses respectively associated with attributes of said group, wherein each thesaurus associated with an attribute is defined with reference to a partition into subsets of a set of words which can be assigned to said attribute and has a respective entry for each subset including at least one word assigned to said attribute in the collection of data; and

means for storing data representing data graph identifier lists respectively associated with the thesaurus entries, wherein the data graph identifier list associated with a entry, relating to a subset, of a thesaurus associated with an attribute includes any identifier allocated to a data graph having a word of said subset assigned to said attribute.

121. A database system according to claim 120, further comprising means for storing a link table having a plurality of rows respectively associated with the data graphs and a plurality of columns respectively associated with the attribute sub-groups, wherein each row of the link table contains, in each one of the columns, either a value indicating that each attribute value represented in the data graph associated with said row and assigned to an attribute of the sub-group associated with said one of the columns is a default value or data for identifying a row of the data table associated with said sub-group.

122. A database system according to claim 120, wherein at least one thesaurus entry comprises data for pointing to at least one memory region where coding data representing the data graph identifier list associated with said entry are stored.

123. A database system according to claim 122, wherein said memory region comprises a file individually allocated to said thesaurus entry.

124. A database system according to claim 122, wherein said memory region comprises a portion of a data container having a plurality of records each having a next address field, said portion being defined as a record chain in said data container, the chains being defined by means of the next address fields, and wherein said data for pointing to at least one memory region comprise a respective address of a first record of a chain in said data container.

125. A database system according to claim 124, further comprising means for grouping the records stored in the data container so that the records of each chain have contiguous addresses.

126. A database system according to claim 120, wherein a plurality of formats are provided for representing the data graph identifier lists, and wherein each thesaurus entry contains an indication of the format used for representing the data graph identifier list associated therewith.

127. A database system according to claim 120, further comprising query processing means including:

means for analyzing query criteria to determine a combination involving thesaurus entries relevant to a query;

means for determining a matching data graph identifier list based on said combination and on the stored data representing the data graph identifier lists associated with said relevant thesaurus entries;

output means for processing said matching data graph identifier list to output a response.

128. A database system according to claim 127, wherein the means for analyzing the query criteria comprise:

means for selecting at least one range of words defined for one attribute in the query criteria; and

mapping means for mapping the words of the selected range which are assigned to said attribute in the collection of data with one or more subsets, and for retaining the thesaurus entry for each of said one or more subset as a relevant entry for the selected range,

and wherein the means for determining the matching data graph identifier list comprise means for merging respective portions of the identifier lists represented by the data of the relevant thesaurus entries retained for said selected range.

129. A database system according to claim 128, wherein the mapping means are arranged to retain a minimum number of relevant thesaurus entries or each selected range.

130. A database system according to claim 128, wherein each thesaurus associated with an attribute is defined with reference to a partition such that each subset consists of one word or of consecutive words of the set of words which can be assigned to said attribute, the entries of said thesaurus being sorted based on the words assigned to said attribute, and wherein the means for analyzing the query criteria comprise dichotomy search means in at least one thesaurus for identifying relevant thesaurus entries.

131. A database system according to claim 130, wherein the thesauruses comprise word thesauruses each associated with a respective attribute of the group, with reference to a partition into subsets each consisting of one word.

132. A database system according to claim 131, wherein each word thesaurus associated with an attribute of the group to which the default value is assigned in at least one of the data graphs further has an entry for the default value, whereby one of said data graph identifier lists is associated with said thesaurus entry for the default value and includes any identifier allocated to a data graph having said default value assigned to said attribute.

133. A database system according to claim 130, wherein the thesauruses further comprise at least one macroword thesaurus associated with an attribute of said group and with a prefix length shorter than a word length of said attribute, whereby said macroword thesaurus is defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length.

134. A database system according to claim 128, wherein the means for analyzing the query criteria comprise means for determining said combination involving relevant thesaurus entries as a tree having at least one leaf node, each leaf node corresponding to at least one relevant thesaurus entry retained for a respective attribute.

135. A database system according to claim 134, wherein said tree has a plurality of nodes including said at least one leaf node and at least one operator node, each operator node representing a Boolean operator applied to at least one partial criterion represented by another node of said tree, one of the operator nodes being a root node representing all the query criteria.

136. A database system according to claim 135, wherein the means for determining the matching data graph identifier list comprise means for obtaining a respective identifier list for each node of said tree, whereby the identifier list obtained for each leaf node corresponding to at least one relevant thesaurus entry is the merger of respective portions of the identifier lists associated with said at least one relevant thesaurus entry, and the identifier list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the identifier lists obtained for the node representing said at least one partial criterion, said matching data graph identifier list being determined as the identifier list obtained for the root node.

137. A database system according to claim 136, wherein each of said obtained identifier lists is produced in the form of a bitmap vector consisting of bits assigned to respective data graphs to indicate whether the identifiers allocated to said data graphs belong to said obtained list.

138. A database system according to claim 136, wherein a coding scheme comprising n successive coding layers is used to provide coding data representing the identifier list associated with a thesaurus entry, n being a number at least equal to 1, each layer having a predetermined pattern for dividing a range covering integers of an input list of said layer into subsets, said identifier list being the input list of the first layer for said thesaurus entry, wherein for any layer other than the last layer, an integer list representing the position, in the pattern of said layer, of each subset containing at least one integer of the input list forms the input list for the next layer,

and wherein the coding data comprise, for each layer and each subset containing at least one integer of the input list, data representing the position of each integer of the input list within said subset and, at least if said layer is the last layer, data representing the position of said subset in the pattern of said layer.

139. A database system according to claims 138, wherein the pattern of each layer is such that the integer subsets are consecutive intervals consisting of the same number of integers.

140. A database system according to claim 139, wherein said number of integers is a whole power of 2 for each layer.

141. A database system according to claim 138, wherein the means for determining the matching data graph identifier list comprise means for determining a layer n integer list for each node of said tree, whereby the layer n integer list determined for a leaf node consists of a layer n input list associated, in the coding scheme, with the merger of the identifier lists represented in the relevant thesaurus entries to which said leaf node corresponds, and whereby the layer n integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer n integer lists determined for the nodes representing said at least one partial criterion, and wherein a layer n result list is determined as the layer n integer list obtained for the root node.

142. A database system according to claim 141, wherein the nodes of said tree further include at least one preset node for which a data graph identifier list has been determined in advance, said data graph identifier list being subjected to the coding scheme to provide a layer n input list which is determined as said layer n integer list for said preset node.

143. A database system according to claim 141, wherein n>1 and the means for determining the matching data graph identifier list further comprise, for k decreasing from n-1 to 1, means for determining a layer k integer list for each node of said tree, whereby the layer k integer list determined for a leaf node consists of any integer of a layer k input list, associated in the coding scheme with the identifier list represented in a relevant thesaurus entry to which said leaf node corresponds, which belongs to a layer k subset whose position is represented in the layer k+1 result list, and whereby the layer k integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer k integer lists determined for the nodes representing said at least one partial criterion, and means for determining a layer k result list as the layer k integer list obtained for the root node, and wherein said matching data graph identifier list corresponds to the determined layer 1 result list.

144. A database system according to claim 127, wherein the output means comprise means for selecting a thesaurus associated with an attribute specified in the query, and means for detecting entries of the selected thesaurus with which identifier lists having a non-empty intersection with the matching data graph identifier list are associated.

145. A database system according to claim 144, wherein the output means further comprise means for writing output data associated with any detected entry of a selected thesaurus into an output table.

146. A database system according to claim 145, wherein the output table includes a respective row corresponding to each identifier of the matching data graph identifier list, and wherein said means for writing output data are arranged to write output data associated with a detected entry of a selected thesaurus into any row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said detected thesaurus entry.

147. A database system according to claim 146, wherein the output table has a plurality of columns each associated with a respective attribute for which a thesaurus is selected, and wherein said means for writing output data are arranged to write into a column of the output table output data associated with a detected entry of a thesaurus selected for an attribute associated with said column.

148. A database system according to claim 145, wherein the selected thesaurus being a word thesaurus defined with reference to a partition into subsets each consisting of one word, the output data associated with a detected entry comprise the word for which said detected entry is provided.

149. A database system according to claim 145, wherein the selected thesaurus being a macroword thesaurus associated with a prefix length and defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length, the output data associated with a detected entry comprise the prefix value for which said detected entry is provided.

150. A database system according to claim 145, wherein the output data associated with a detected entry comprise an address of said detected entry in the selected thesaurus.

151. A database system according to claim 145, wherein the output data associated with a detected entry of a selected thesaurus comprise a numerical value derived from said thesaurus entry.

152. A database system according to claim 151, further comprising means for calculating said numerical value for a detected entry of at least one selected thesaurus by applying a mathematical function to a thesaurus value stored in said entry.

153. A database system according to claim 151, further comprising means for calculating said numerical value for a detected entry of at least one selected thesaurus by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in the output table.

154. A database system according to claim 151, wherein the output table includes a respective row corresponding to each identifier of the matching data graph identifier list, and wherein said means for writing output data are arranged to write a numerical value derived from a detected thesaurus entry into any row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said detected thesaurus entry.

155. A database system according to claim 154, wherein the numerical value, derived from a detected entry of a first selected thesaurus and written into any row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said entry of the first selected thesaurus, is obtained from a thesaurus value stored in said entry,

and wherein the numerical value, derived from a detected entry of at least one second selected thesaurus and written into a row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said entry of the second selected thesaurus, is calculated by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in said row of the output table.

156. A database system according to claim 154, further comprising means for calculating an output value from a set of numerical values which have been respectively written into the rows of the output table.

157. A database system according to claim 127, wherein the means for storing data representing identifier lists comprise a plurality of storage sections respectively associated with distinct portions of an integer range covering the identifiers allocated to the data graphs according to a predetermined partition, wherein each storage section associated with one of said portions contains data representing identifier sub-lists consisting of identifiers of said portion,

and wherein the means for determining the matching data graph identifier list comprise means for determining respective matching identifier sub-lists included in the different portions of the data graph identifier range, by means of the respective storage sections.

158. A database system according to claim 157, comprising a processor to determine said matching identifier sub-lists sequentially.

159. A database system according to claim 157, comprising a plurality of processors respectively associated to the different portions of the data graph identifier range, to determine said matching identifier sub-lists in parallel.

160. A database system according to claim 159, wherein the means for analyzing query criteria comprise a query server connected to said processors through a communication network.

161. A database system according to claim 160, further comprising a list update server connected, through the communication network, to a plurality of storage units respectively coupled to said processors, the list update server controlling the storage units to maintain the storage sections.

162. A computer program product for managing information from a plurality of data tables having different numbers of records, wherein the group of attributes is divided into a plurality of sub-groups respectively associated with the data tables, with links between respective records from the data tables, each data table having a column for each attribute of the associated sub-group and rows for containing data table records comprising at least one word assigned to an attribute of the associated sub-group in a collection of data, wherein respective identifiers are allocated to data graphs, each data graph representing related attribute values respectively assigned to the attributes of said group, each attribute value of a data graph being either a default value or a word of said collection of data, the words of each data graph being from linked data table records, the computer program product comprising:

instructions for storing a plurality of thesauruses respectively associated with attributes of said group, wherein each thesaurus associated with an attribute is defined with reference to a partition into subsets of a set of words which can be assigned to said attribute and has a respective entry for each subset including at least one word assigned to said attribute in the collection of data; and

instructions for storing data representing data graph identifier lists respectively associated with the thesaurus entries, wherein the data graph identifier list associated with a entry, relating to a subset, of a thesaurus associated with an attribute includes any identifier allocated to a data graph having a word of said subset assigned to said attribute.

163. A computer program product according to claim 162, further comprising instructions for storing a link table having a plurality of rows respectively associated with the data graphs and a plurality of columns respectively associated with the attribute sub-groups, wherein each row of the link table contains, in each one of the columns, either a value indicating that each attribute value represented in the data graph associated with said row and assigned to an attribute of the sub-group associated with said one of the columns is a default value or data for identifying a row of the data table associated with said sub-group.

164. A computer program product according to claim 162, wherein at least one thesaurus entry comprises data for pointing to at least one memory region where coding data representing the data graph identifier list associated with said entry are stored.

165. A computer program product according to claim 164, wherein said memory region comprises a file individually allocated to said thesaurus entry.

166. A computer program product according to claim 164, wherein said memory region comprises a portion of a data container having a plurality of records each having a next address field, said portion being defined as a record chain in said data container, the chains being defined by means of the next address fields, and wherein said data for pointing to at least one memory region comprise a respective address of a first record of a chain in said data container.

167. A computer program product according to claim 166, further comprising instructions for grouping the records stored in the data container so that the records of each chain have contiguous addresses.

168. A computer program product to claim 162, wherein a plurality of formats are provided for representing the data graph identifier lists, and wherein each thesaurus entry contains an indication of the format used for representing the data graph identifier list associated therewith.

169. A computer program product according to claim 162, further comprising instructions for processing a query including:

instructions for analyzing query criteria to determine a combination involving thesaurus entries relevant to the query;

instructions for determining a matching data graph identifier list based on said combination and on the stored data representing the data graph identifier lists associated with said relevant thesaurus entries;

instructions for outputting a response by processing said matching data graph identifier list.

170. A computer program product according to claim 169, wherein the instructions for analyzing the query criteria comprise:

instructions for selecting at least one range of words defined for one attribute in the query criteria; and

instructions for mapping the words of the selected range which are assigned to said attribute in the collection of data with one or more subsets, and for retaining the thesaurus entry for each of said one or more subset as a relevant entry for the selected range,

and wherein the instructions for determining the matching data graph identifier list comprises instructions for merging respective portions of the identifier lists represented by the data of the relevant thesaurus entries retained for said selected range.

171. A computer program product according to claim 170, wherein the instructions for mapping the words of the selected range are arranged to retain a minimum number of relevant thesaurus entries for each selected range.

172. A computer program product according to claim 170, wherein each thesaurus associated with an attribute is defined with reference to a partition such that each subset consists of one word or of consecutive words of the set of words which can be assigned to said attribute, the entries of said thesaurus being sorted based on the words assigned to said attribute, and wherein the instructions for analyzing the query criteria comprise instructions for performing a dichotomy search in at least one thesaurus for identifying relevant thesaurus entries.

173. A computer program product according to claim 172, wherein the thesauruses comprise word thesauruses each associated with a respective attribute of the group, with reference to a partition into subsets each consisting of one word.

174. A computer program product according to claim 173, wherein each word thesaurus associated with an attribute of the group to which the default value is assigned in at least one of the data graphs further has an entry for the default value, whereby one of said data graph identifier lists is associated with said thesaurus entry for the default value and includes any identifier allocated to a data graph having said default value assigned to said attribute.

175. A computer program product according to claim 172, wherein the thesauruses further comprise at least one macroword thesaurus associated with an attribute of said group and with a prefix length shorter than a word length of said attribute, whereby said macroword thesaurus is defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length.

176. A computer program product according to claim 170, wherein the instructions for analyzing the query criteria comprise instructions for determining said combination involving relevant thesaurus entries as a tree having at least one leaf node, each leaf node corresponding to at least one relevant thesaurus entry retained for a respective attribute.

177. A computer program product according to claim 176, wherein said tree has a plurality of nodes including said at least one leaf node and at least one operator node, each operator node representing a Boolean operator applied to at least one partial criterion represented by another node of said tree, one of the operator nodes being a root node representing all the query criteria.

178. A computer program product according to claim 177, wherein the instructions for determining the matching data graph identifier list comprises instructions for obtaining a respective identifier list for each node of said tree, whereby the identifier list obtained for each leaf node corresponding to at least one relevant thesaurus entry is the merger of respective portions of the identifier lists associated with said at least one relevant thesaurus entry, and the identifier list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the identifier lists obtained for the node representing said at least one partial criterion, said matching data graph identifier list being determined as the identifier list obtained for the root node.

179. A computer program product according to claim 178, wherein each of said obtained identifier lists is produced in the form of a bitmap vector consisting of bits assigned to respective data graphs to indicate whether the identifiers allocated to said data graphs belong to said obtained list.

180. A computer program product according to claim 178, wherein a coding scheme comprising n successive coding layers is used to provide coding data representing the identifier list associated with a thesaurus entry, n being a number at least equal to 1, each layer having a predetermined pattern for dividing a range covering integers of an input list of said layer into subsets, said identifier list being the input list of the first layer for said thesaurus entry, wherein for any layer other than the last layer, an integer list representing the position, in the pattern of said layer, of each subset containing at least one integer of the input list forms the input list for the next layer,

and wherein the coding data comprise, for each layer and each subset containing at least one integer of the input list, data representing the position of each integer of the input list within said subset and, at least if said layer is the last layer, data representing the position of said subset in the pattern of said layer.

181. A computer program product according to claims 180, wherein the pattern of each layer is such that the integer subsets are consecutive intervals consisting of the same number of integers.

182. A computer program product according to claim 181, wherein said number of integers is a whole power of 2 for each layer.

183. A computer program product according to claim 180, wherein the instructions for determining the matching data graph identifier list comprise instructions for determining a layer n integer list for each node of said tree, whereby the layer n integer list determined for a leaf node consists of a layer n input list associated, in the coding scheme, with the merger of the identifier lists represented in the relevant thesaurus entries to which said leaf node corresponds, and whereby the layer n integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer n integer lists determined for the nodes representing said at least one partial criterion, and wherein a layer n result list is determined as the layer n integer list obtained for the root node.

184. A computer program product according to claim 183, wherein the nodes of said tree further include at least one preset node for which a data graph identifier list has been determined in advance, said data graph identifier list being subjected to the coding scheme to provide a layer n input list which is determined as said layer n integer list for said preset node.

185. A computer program product according to claim 183, wherein n>1 and the instructions for determining the matching data graph identifier list further comprise, for k decreasing from n-1 to 1, instructions for determining a layer k integer list for each node of said tree, whereby the layer k integer list determined for a leaf node consists of any integer of a layer k input list, associated in the coding scheme with the identifier list represented in a relevant thesaurus entry to which said leaf node corresponds, which belongs to a layer k subset whose position is represented in the layer k+1 result list, and whereby the layer k integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer k integer lists determined for the nodes representing said at least one partial criterion, and instructions for determining a layer k result list as the layer k integer list obtained for the root node, and wherein said matching data graph identifier list corresponds to the determined layer 1 result list.

186. A computer program product according to claim 169, wherein said instructions for outputting a response comprise instructions for selecting a thesaurus associated with an attribute specified in the query, and instructions for detecting entries of the selected thesaurus with which identifier lists having a non-empty intersection with the matching data graph identifier list are associated.

187. A computer program product according to claim 186, wherein said instructions for outputting a response comprise further comprise instructions for writing output data associated with any detected entry of a selected thesaurus into an output table.

188. A computer program product according to claim 187, wherein the output table includes a respective row corresponding to each identifier of the matching data graph identifier list, and wherein said instructions for writing output data are arranged to control the writing of output data associated with a detected entry of a selected thesaurus into any row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said detected thesaurus entry.

189. A computer program product according to claim 188, wherein the output table has a plurality of columns each associated with a respective attribute for which a thesaurus is selected, and wherein said instructions for writing output data are arranged to control the writing into a column of the output table of output data associated with a detected entry of a thesaurus selected for an attribute associated with said column.

190. A computer program product according to claim 187, wherein the selected thesaurus being a word thesaurus defined with reference to a partition into subsets each consisting of one word, the output data associated with a detected entry comprise the word for which said detected entry is provided.

191. A computer program product according to claim 187, wherein the selected thesaurus being a macroword thesaurus associated with a prefix length and defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length, the output data associated with a detected entry comprise the prefix value for which said detected entry is provided.

192. A computer program product according to claim 187, wherein the output data associated with a detected entry comprise an address of said detected entry in the selected thesaurus.

193. A computer program product according to claim 187, wherein the output data associated with a detected entry of a selected thesaurus comprise a numerical value derived from said thesaurus entry.

194. A computer program product according to claim 193, further comprising instructions for calculating said numerical value for a detected entry of at least one selected thesaurus by applying a mathematical function to a thesaurus value stored in said entry.

195. A computer program product according to claim 193, further comprising instructions for calculating said numerical value for a detected entry of at least one selected thesaurus by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in the output table.

196. A computer program product according to claim 193, wherein the output table includes a respective row corresponding to each identifier of the matching data graph identifier list, and wherein said instructions for writing output data are arranged to control the writing of a numerical value derived from a detected thesaurus entry into any row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said detected thesaurus entry.

197. A computer program product according to claim 196, wherein the numerical value, derived from a detected entry of a first selected thesaurus and written into any row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said entry of the first selected thesaurus, is obtained from a thesaurus value stored in said entry,

and wherein the numerical value, derived from a detected entry of at least one second selected thesaurus and written into a row of the output table corresponding to a data graph identifier belonging to both the matching data graph identifier list and the identifier list associated with said entry of the second selected thesaurus, is calculated by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in said row of the output table.

198. A computer program product according to claim 196, further comprising instructions for calculating an output value from a set of numerical values which have been respectively written into the rows of the output table.

199. A computer program product according to claim 169, wherein the instructions for storing data representing identifier lists are arranged to supervise a plurality of storage sections respectively associated with distinct portions of an integer range covering the identifiers allocated to the data graphs according to a predetermined partition, wherein each storage section associated with one of said portions contains data representing identifier sub-lists consisting of identifiers of said portion,

and wherein the instructions for determining the matching data graph identifier list comprise instructions for determining respective matching identifier sub-lists included in the different portions of the data graph identifier range, by means of the respective storage sections.

200. A computer program product according to claim 199, wherein said matching identifier sub-lists are determined sequentially.

201. A computer program product according to claim 199, wherein said matching identifier sub-lists are determined in parallel.


Description

BACKGROUND OF THE INVENTION

The present invention relates to relational database management systems (RDBMS), and more particularly to computerized systems for storing and accessing large amounts of data.

In a non-limiting manner, the invention is applicable to "data warehouses". On-line transaction processing (OLTP) systems, such as for bank teller transactions and airline reservations, are optimized for finding a record associated with a specific key, e.g. finding the information about employee 123124. By contrast, data warehouses are optimized for finding sets of records very quickly. The reason is that typical queries are of the form: "find all sales by region and quarter" or "find stores that sell the greatest volume of sportswear per month" or "select the top 5 stores for each product category for the last year". Such queries must typically access large sets of rows in data tables. The query processing challenge is to process these queries without doing a linear scan of all or most of the database.

Five main approaches have been proposed to attack this problem: (i) multidimensional arrays; (ii) special indexes; (iii) table caching; (iv) optimized foreign key joins; and (v) approximation.

(i) Multidimensional Arrays (i.e. Matrices).

This strategy consists of implementing the data warehouse as a multidimensional array or matrix. Examples may be found in U.S. Pat. No. 5,359,724 and U.S. Pat. No. 5,864,857. Each dimension corresponds to an attribute of the data. For example, a sales table can be viewed as a matrix with coordinates: store location, product type, customer id, and so on. A particular sale can be identified by specifying all of these attributes. The strategy works well for small databases or very dense ones. By dense, we mean that the Cartesian product of possible values should all be meaningful, e.g., every customer is likely to buy every product from every store. Since this is rarely true, this scheme must be modified to deal with sparse values. This can be done by defining a notion of sparse attributes and dense ones. So, for example, it might be that every store carries every product (a dense relationship that can be stored in a matrix), but only some of these combinations are valid for any given customer. So, a conventional index would be used whenever customer sales are involved, but a dense one for queries involving store-wide or product-wide sales.

(ii) Special Indexes.

Bitmap indexes are an index structure tailored to data warehouses (see, e.g. U.S. Pat. No. 5,903,888). These indexes have already been used in some commercial products to speed up query processing. In its simplest form, a bitmap index on an attribute consists of one vector of bits (i.e. bitmap) per attribute value, where the size of each bitmap is equal to the number of records in the indexed relation. For example, if the attribute is day-of-week, then there would be seven bitmap vectors for that attribute, one for each day. The bitmap vector corresponding to Monday would have a 1 at position i if record i contains "Monday" in the day-of-week attribute. This single value-based approach is called a Value-List index. Other techniques (e.g. U.S. Pat. No. 5,761,652) associate bit vectors with ranges of values, so there could, for a salary attribute, be a vector for the range 0 to 20,000 Euros, 20,000.01 to 35,000 Euros, and so on. Still others associate each bit vector with a bit value (a 1 or a 0) in a given position. So, if the attribute holds n bit numbers, then there would be 2n bit vectors (position 1, bit value 1; position 1, bit value 0; position 2 bit value 1; . . . ).

The benefit of bit vectors is that it is easy to use multiple bit vectors to answer a single query. Consider a query on several predicates, each of which is indexed. Most conventional database management systems would use just one of the indexes (the one that is most "selective" so returns the fewest rows), though some systems might attempt to intersect the record identifiers of multiple indexes.

Bitmaps work better, because they are more compact and intersecting several bitmaps is much faster than intersecting several collections of record identifiers. In the best case, the improvement is proportional to the word size of the machine. For example, suppose the word size is 32 bits. Then two bit vectors can be intersected 32 bits at a time. Each set of 32 bits corresponds to 32 record identifiers being intersected. That best case occurs when each predicate is unselective (i.e. many records match each predicate value), but all the predicates together are quite selective. Consider for example the query: "Find people who have brown hair, glasses, ages between 30 and 40, blue eyes, work in the computer industry, live in California, . . . ".

So, matrices are best when sets of predicates are dense (all, or nearly all, values in the Cartesian product are possible), bitmaps are best when predicates are neither dense nor individually selective. An intermediate approach (when there is insufficient density for matrices but many values in the Cartesian product are present) is to use multidimensional indexes. Multidimensional indexes such as quadtrees, R-trees and their successors are implemented as variable sized grids on a multidimensional space. The grids are of variable sizes because the population of points differs in different places in a hyperspace. For intuition, consider a map of equi-population rectangles of France. The rectangles would be far more dense in Paris than in the alps. Indexes like this work well for spatial data (where they are used to find the points contained in latitude-longitude quadrants). This alternative is little explored in the commercial arena except for geographical queries, however, because these schemes do not scale well with increasing dimensionality and commercial systems typically have far more than three dimensions.

(iii) Table Caching.

If one doesn't have the luxury to design new indexes on top of a database system (because one is not the implementer of that system) one can pre-compute a large number of anticipated aggregate queries and put them in tables. For example, if a large retailer frequently asks queries that sum the total sales across multiple stores or multiple products, one may store such information in special tables. The main cost of such a strategy is maintaining these tables in the face of updates. (Disk space is no longer a major factor.) In the example, every sale of item I at store S would have to update the total product sales table for I and the total store sales table for S. So, this strategy is worthwhile if there are few updates between queries. The strategy is not worthwhile if there are many.

(iv) Optimized Foreign Key Joins.

Most queries in multidimensional tables entail joins between a central "fact table" (e.g. sales detail) and a set of dimension tables (e.g. store description, product description, customer description). These are known as "foreign key joins" since the customer identifier in the sales table, for example, is a key of the customer description table. (A key is a value belonging to an attribute such that only one record has that value in the attribute.) One way to accelerate these joins is to create a linkage between fact table records and dimension records. This can be done in three basic ways

(a) create an index that holds fact table record identifiers and dimension table record identifiers;

(b) create bidirectional pointers between fact table records and dimension table rows--this is what "object-oriented" databases do;

(c) replace the customer record identifiers in the fact table by offsets into the dimension tables.

Choice (a) is the most independent of changes in the physical organization of the tables and therefore is best for heavily updated systems, because changes to the dimension table can be reflected in the index to that table alone. Choice (b) is the least flexible to physical reorganization, because reorganizing a dimension table would entail updating the fact table. Choice (c) is a compromise of the two in that certain physical reorganizations can be done to the dimension tables (e.g. changing its position on disk) without changing the fact table. Examples of join optimization may be found in U.S. Pat. No. 5,548,754, U.S. Pat. No. 5,671,403, U.S. Pat. No. 5,724,568, U.S. Pat. No. 5,752,017, U.S. Pat. No. 5,761,657 and U.S. Pat. No. 5,822,747.

(v) Approximating the Result

Since most people use data warehouses to get strategic aggregate information, many would be happy with a fast approximation as long as it has error bounds. Typical work in this area is illustrated by U.S. Pat. No. 5,870,752, which shows how to estimate aggregate results in data warehouses while giving error bounds. The basic problem is that sampling all tables and then doing aggregates does not work in general. For example, if one wants to join R and S on their keys, then taking a 1/10 sample of each will give a size that is 1/100 of the size of the real join if the samples are random. So, one must be more clever. The idea is to take an initial set of tables R, S, T, . . . that are linked by foreign key joins. Suppose for example that R is the fact table and the others are dimension tables. Take a sample of R and then perform all these foreign key joins based on the sample giving R', S', T', . . . . Now, if a query involves R, S, T and includes the foreign key links among these, then the query can be done with great accuracy on R', S', T'. The error can be estimated by considering the result obtained by several partitions of R' and looking at their variance.

An object of the present invention is to propose an alternative method of organizing a database management system, which enables an efficient query processing.

SUMMARY OF THE INVENTION

The invention proposes a method of organizing information in a database system, wherein a group of attributes is defined and words of a collection of data are assigned to said attributes. The group of attributes is divided into a plurality of sub-groups each associated with a respective data table, each data table having a column for each attribute of the associated sub-group and rows for containing data table records comprising at least one word assigned to an attribute of the associated sub-group. Links are defined between the data tables records, each link having a target table and a corresponding source table having a link column containing link values each designating a record of said target table. Each of said link values represents a link between the record of the source table including said link value and the record of the target table designated by said link value. The method comprises the steps of:

allocating respective identifiers to data graphs, wherein each data graph represents related attribute values respectively assigned to the attributes of said group, wherein each attribute value of a data graph is either a default value or a word of said collection of data, and wherein the words of each data graph are from linked data table records;

storing a plurality of word thesauruses respectively associated with attributes of said group, wherein for each word assigned at least once to an attribute in the collection of data, the word thesaurus associated with said attribute has a respective entry containing said word; and

storing data representing data graph identifier lists respectively associated with the word thesaurus entries, wherein the data graph identifier list associated with a thesaurus entry relating to a word assigned to an attribute includes any identifier allocated to a data graph having said word assigned to said attribute.

The invention further proposes a method of processing queries in a database system, wherein a group of attributes is defined and words of a collection of data are assigned to said attributes, the group of attributes being divided into a plurality of sub-groups respectively associated with a plurality of data tables having independent numbers of records, with links between respective records from the data tables. Identifiers are respectively allocated to data graphs, each data graph representing related attribute values respectively assigned to the attributes of said group, each attribute value of a data graph being either a default value or a word of said collection of data. A plurality of thesauruses each associated with a respective attribute of said group and data representing data graph identifier lists respectively associated with entries of said thesauruses are stored. Each thesaurus associated with one attribute is defined with reference to a partition into subsets of a set of words which can be assigned to said one attribute and has a respective entry for each subset including at least one word assigned to said one attribute in the collection of data, the data graph identifier list associated with said thesaurus entry including any identifier allocated to a data graph having a word of said subset assigned to said one attribute. The method comprises the steps of:

analyzing query criteria to determine a combination involving thesaurus entries relevant to the query;

determining a matching data graph identifier list based on said combination and on the stored data representing the data graph identifier lists associated with said relevant thesaurus entries;

processing said matching data graph identifier list to output a response.

The invention further proposes a database system implementing a method as outlined above, and computer program products having instructions for carrying out such method.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1-3 show an example of data structure as typically used in a conventional relational database system.

FIG. 4 is a diagram representing a data table tree in the example of FIGS. 1-3.

FIGS. 5-7 are diagrams showing respective data graphs constructed with the tree of FIG. 4 and the data of FIGS. 1-3.

FIG. 8 is a flat file representation of the data tables of FIGS. 1-3.

FIG. 9 shows a link table as used in an embodiment of the invention.

FIGS. 10A-H show the contents of thesauruses corresponding to the data tables of FIGS. 1-3.

FIGS. 11A-14A, 11G-14G and 11H-14H show other representations of the thesauruses of FIGS. 10A, 10G and 10H, respectively.

FIGS. 15-16 illustrate the data stored in a data container in connection with the thesauruses of FIGS. 14A, 14G and 14H.

FIG. 17 shows another possible structure of the thesaurus of FIGS. 10A-14A.

FIG. 18 is a block diagram of a computer system suitable for implementing the invention.

FIG. 19 is a flow chart showing a data graph creation procedure in accordance with an embodiment the invention.

FIG. 20 is a flow chart showing a procedure applicable in stage 124 of FIG. 19.

FIGS. 21 and 22 are flow charts showing procedures applicable in step 136 of FIG. 20.

FIGS. 23 and 24 are flow charts showing another procedure applicable in step 136 of FIG. 20 in two successive coding layers.

FIGS. 25-32 are tables showing a way of storing thesauruses constructed from the example of FIGS. 1-3.

FIG. 33 is a flow chart showing an alternative way of executing steps 135 and 136 of FIG. 20 when the thesauruses are stored as shown in FIG. 17.

FIGS. 34A and 34B are tables showing an alternative embodiment of the tables of FIGS. 31-32.

FIGS. 34C and 34D are another representation of the tables of FIGS. 34A and 34B.

FIG. 35 is a flow chart showing a procedure applicable in the management of tables of the type shown in FIGS. 34A and 34B.

FIG. 36 is a general flow chart of a query processing procedure in accordance with an embodiment of the invention.

FIG. 37 is a diagram showing an example of query tree referring to the example of FIGS. 1-3.

FIG. 38 is another diagram showing an expanded query tree obtained by analyzing the query tree of FIG. 37.

FIG. 39 is a flow chart showing a procedure of analyzing the query tree.

FIG. 40, which is obtained by placing FIG. 40A above FIG. 40B, is a flow chart of a recursive function referred in the procedure of FIG. 39.

FIG. 41 is the flow chart procedure for identifying matching data graphs based on an expanded query tree as illustrated in FIG. 38.

FIG. 42 is a flow chart of a recursive function FNODE called to in the procedure of FIG. 41.

FIGS. 43-45 are flow charts illustrating procedures executed in steps 262, 264 and 265 of FIG. 42, respectively.

FIG. 46 is a flow chart showing an alternative embodiment of the procedure of step 265 of FIG. 42.

FIG. 47 is a flow chart showing another alternative embodiment of the procedure of step 265 of FIG. 42, when the thesauruses are stored as illustrated in FIGS. 34A and 34B.

FIG. 48 is a flow chart of a recursive function FILT called in the procedure of FIG. 47.

FIG. 49 is a flow chart showing another alternative embodiment of the procedure of step 265 of FIG. 42, when the thesauruses are stored as illustrated in FIG. 17.

FIG. 50 is a flow chart of a variant of a leaf processing used in the function of FIG. 42.

FIG. 51 is a flow chart showing a procedure applicable for scanning the thesaurus relating to a given attribute in order to retrieve the attribute values relevant to a database query.

FIG. 52 is a flow chart of a function FINTER referred to in the procedure of FIG. 51.

FIGS. 53-55 are flow charts showing procedures executed in steps 355, 357 and 358 of FIG. 52, respectively.

FIG. 56 is a flow chart showing an alternative procedure applicable in step 358 of FIG. 52, when the thesauruses are stored as illustrated in FIGS. 33-34.

FIG. 57 is a flow chart of a recursive function FFILT called in the procedure of FIG. 56.

FIGS. 58-61 show tables which may be stored to cooperate with the tables of FIGS. 25-34.

FIG. 62 is a flow chart showing a pre-filtering procedure which may be used prior to a thesaurus scanning similar to that of FIG. 51.

FIG. 63 is a flow chart showing a part of a thesaurus scanning procedure according to FIG. 51, adapted to take into account a pre-filtering according to FIG. 62.

FIG. 64 is a flow chart showing an alternative procedure applicable in step 358 of FIG. 52, when the thesauruses are stored as illustrated in FIG. 17.

FIG. 65 is a flow chart showing a procedure applicable in step 335 of FIG. 51.

FIGS. 66 and 67 show the contents of an exemplary output table used to provide a query response.

FIG. 68 is a diagram illustrating another possible structure of the output table.

FIGS. 69 and 70 are flow charts showing procedures applicable in step 335 of FIG. 51 to construct an output table of the type shown in FIG. 68.

FIG. 71 is a flow chart showing a procedure applicable in step 335 of FIG. 51 to perform computations in a database system by means of a computation table.

FIG. 72 is a block diagram of another computer system suitable for implementing the invention.

DESCRIPTION OF PREFERRED EMBODIMENTS

VIRTUAL DATA GRAPHS

FIGS. 1-3 illustrate a collection of data which can be stored in a computer memory coupled with a processor arranged for running relational database management programs. This example will be referred to in the following description to give an illustration of the principles and embodiments of the invention where appropriate.

FIGS. 1-3 show a conventional type of data organization in a database system. The illustrated system handles data relevant to a hypothetical insurance company which manages policies for its clients. The data are organized in three tables relating to the clients, policies and accidents as shown in FIGS. 1-3, respectively.

From a logical point of view, each data table consists of a two-dimensional matrix, with rows corresponding to respective records in the table and columns corresponding to respective data attributes of the records or structural features of the database (the latter type of column typically contains either local record identification keys or foreign keys designating records in a target table).

It will be appreciated, however, that for large databases the actual storage of the data in a memory medium, e.g. a magnetic disc, is frequently performed otherwise: each row typically has a memory address where the corresponding attribute values or keys are stored in the order of the columns and separated by predetermined symbols such as the encoded character ".backslash.".

In our simplified example given to facilitate the explanation of the proposed data structures, the tables are of modest size. In practice, there are usually more tables and more attributes (columns) per table (notwithstanding, one ore more tables could also have a single column). Moreover, the data tables generally include much more records, up to thousands or millions of rows depending on the application.

In that example, the database a group of seven attributes distributed into three sub-groups corresponding to the three data tables. Each attribute has a column in the data table corresponding to its sub-group. The client data table (FIG. 1) has three attributes, i.e. client name, birth year and gender. The policy data table of FIG. 2 has two attributes, i.e. policy type ("car" or "house") and policy effect date, and a link column to the client table. The accident data table of FIG. 3 has two attributes, i.e. date of accident and amount of damages incurred in a given currency, and a link column to the policy table.

In a given data table, each record/row has a unique identifier, referred to as a row-ID. This identifier corresponds to the memory address where the record is stored, usually through a conversion table. It may be stored as an identification key in a column of the data table for the purposes of unique row identification, but this is not compulsory. In our example, the row-ID's are integer indexes starting from zero for each data table, and they are not stored explicitly in a column of the table.

Some of the tables are linked together, as indicated in the last column of FIGS. 2 and 3. Two tables are directly linked if one of them (source table) has a link column provided for containing foreign keys designating records of the other one (target table).

Those foreign keys, hereafter called links, reflect the hierarchy and organization of the data handled in the relational database system. In our example, each accident dealt with by the insurance company is related to a certain policy managed by the company, hence the policy links of FIG. 3. Each policy is for a particular client of the company, hence the client links of FIG. 2. It will be noted that some links may be optional. For example, some accidents may involve third parties and if there is a separate table for third parties, then each record of the accident table may have a link to the third party table.

Each link typically consists of a row-ID in the target data table. For instance, the accident stored as row-ID=0 in the accident table of FIG. 3, which took place on Oct. 3, 1998 for an amount of 1,000 has a policy link pointing to the policy stored as row-ID=1 in the policy table of FIG. 2, i.e. it relates to a car policy subscribed on Sep. 9, 1998 by the client with row-ID=1 in the client table of FIG. 1, i.e. Andre, a man born in 1976. If the target table has other forms of record identification keys, for example compound keys, a link may also