Generating database or data structure (e.g., via user interface)

Methods of organizing data and processing queries in a database system, and database system and software product for implementing such methods

6633883

Abstract

A reference table, which may not be stored, has columns associated with data attributes and rows containing related words assigned to those attributes in a collection of data. The stored data include at least one macroword thesaurus associated with an attribute and with a prefix length shorter than a word length of said attribute, and reference table row identifier lists respectively associated with thesaurus entries. Each macroword thesaurus associated with an attribute and with a prefix length has a respective entry for each prefix value having this prefix length and matching a corresponding prefix of at least one word assigned to this data attribute in the collection of data.


Claims

What is claimed is:

1. A method of organizing information in a database system, wherein a plurality of row identifiers are defined to designate respective rows of a reference table having colunms respectively associated with data attributes, said rows containing groups of related words assigned to said attributes in a collection of data, the method comprising the steps of:

storing at least one macroword thesaurus associated with one of the attributes and with a prefix length shorter than a word length of said attribute, said macroword thesaurus having a respective entry for each prefix value having said prefix length and matching at least a beginning portion of at least one word assigned to said data attribute in the collection of data; and

storing data representing identifier lists respectively associated with the macroword thesaurus entries, wherein the identifier list associated with an entry, relating to a prefix value, of a macroword thesaurus associated with an attribute includes any row identifier designating a row of the reference table having a word beginning with at least a portion matched by said prefix value in the column associated with said attribute.

2. A method according to claim 1, wherein the entries of each macroword thesaurus associated with an attribute are sorted based on the prefix values.

3. A method according to claim 1, wherein a plurality of macroword thesauruses associated with different prefix lengths are stored for at least one attribute.

4. A method according to claim 1, further comprising the step of storing a word thesaurus associated with said one of the attributes, said word thesaurus having a respective entry for each word assigned at least once to said attribute in the collection of data, said entry containing data representing an identifier list including each row identifier designating a row of the reference table having said word in the column associated with said attribute.

5. A method according to claim 4, wherein the word thesaurus associated with an attribute for which the reference table has a default value in at least one row further has an entry for the default value, containing data representing an identifier list including each row identifier designating a row of the reference table having said default value in the column associated with said attribute.

6. A method according to claim 4, wherein the entries of the word thesaurus are sorted based on the words assigned to said attribute.

7. A method according to claim 6, wherein the entries of each macroword thesaurus associated with an attribute are sorted based on the prefix values, wherein at least one attribute has a number Q of stored macroword thesauruses associated with different prefix lengths, each having a thesaurus level parameter q such that 1.ltoreq.Q, Q being an integer at least equal to 1, the prefix length being a decreasing function of the level parameter if Q>1, wherein the level 1 macroword thesaurus further contains, in each entry provided for a level 1 prefix value, data designating the entry of the word thesaurus associated with said attribute which corresponds to the lowest or highest word beginning with at least a portion matched by said level 1 prefix value, and wherein any macroword thesaurus having a level parameter q>1 further contains, in each entry provided for a level q prefix value, data designating the entry of the level q-1 macroword thesaurus which corresponds to the lowest or highest level q-1 macroword beginning with at least a portion matched by said level q prefix value.

8. A method according to claim 1, wherein said reference table is a virtual table which is not stored.

9. A method according to claim 8, further comprising the step of storing a link table having a plurality of rows respectively associated with the rows of the reference table and a plurality of columns respectively associated with attribute sub-groups, wherein each row of the link table contains, in each one of the columns, either a value indicating that each attribute value represented in the associated reference table row and assigned to an attribute of the sub-group associated with said one of the columns is a default value or a link value for retrieving at least one stored word of the collection of data represented in the associated reference table row and assigned to an attribute of the sub-group associated with said one of the columns.

10. A method according to claim 9, wherein a respective data table is stored for each of the attribute sub-groups, and wherein each link value contained in a column of the link table associated with an attribute sub-group comprises data for identifying a row of the data table stored for said sub-group.

11. A method according to claim 1, wherein said data representing identifier lists comprise, for at least one thesaurus entry, coding data obtained by a coding scheme having n successive coding layers, n being a number at least equal to 1, each layer having a predetermined pattern for dividing a range covering integers of an input list of said layer into subsets, said identifier list being the input list of the first layer for said thesaurus entry, wherein for any layer other than the last layer, an integer list representing the position, in the pattern of said layer, of each subset containing at least one integer of the input list forms the input list for the next layer, and wherein said coding data comprise, for each layer and each subset containing at least one integer of the input list, data representing the position of each integer of the input list within said subset and, at least if said layer is the last layer, data representing the position of said subset in the pattern of said layer.

12. A method according to claim 11, wherein the coding data are stored in a plurality of files including files respectively allocated to thesaurus entries.

13. A method according to claim 11, wherein the coding data are stored in a plurality of files including at least one file allocated to a respective thesaurus, for containing the coding data relating to the entries of said thesaurus.

14. A method according to claim 11, wherein the coding data are stored in at least one file allocated to a plurality of thesauruses, for containing the coding data relating to the entries of said plurality of thesauruses.

15. A method according to claim 11, wherein the pattern of each layer is such that the integer subsets are consecutive intervals consisting of the same number of integers.

16. A method according to claim 15, wherein said number of integers is a whole power of 2 for each layer.

17. A method according to claim 11, wherein n.gtoreq.2 and layer k data containers each having a plurality of records are provided in a computer memory for 1.ltoreq.k.ltoreq.n, each record of a layer k data container being associated with a layer k integer rank representing the position of a subset in the layer k pattern, and wherein each record of a layer k data container associated with a layer k rank representing the position of a subset in the layer k pattern has a first field for containing data for retrieving the position within said subset of any integer of a layer k input list relating to an identifier list, whereby a combination of said layer k rank with any position retrievable from the data contained in said first field determines a layer k-1 rank with which a respective record of the layer k-1 data container is associated if k>1, and an identifier of said identifier list if k=1.

18. A method according to claim 17, wherein, for 1.ltoreq.k.ltoreq.n, said data contained in the first field of a record of the layer k data container for retrieving the position of any integer of a layer k input list within a subset comprise a bitmap segment in which each bit is associated with a respective integer of said subset to indicate whether said integer belongs to said layer k input list.

19. A method according to claim 18, wherein, for 1.ltoreq.k.ltoreq.n, each record of the layer k data container associated with a layer k rank further has a second field for containing said layer k rank.

20. A method according to claim 19, wherein each data container comprises at least two files where the first and second fields of the records of said data container are respectively stored, said files being accessible separately.

21. A method according to claim 17, wherein, for 1.ltoreq.k.ltoreq.n, each record of the layer k data container further has a second field for containing a number representing the position of an integer of a layer k+1 input list within a subset of the layer k+1 pattern,

and wherein, for 1<k.ltoreq.n, said data contained in the first field of a record of the layer k data container associated with a layer k rank for retrieving the position of any integer of a layer k input list within a subset of the layer k pattern comprise a pointer to at least one record of the layer k-1 data container in which the second field contains a number representing the position of an integer of said layer k input list within said subset of the layer k pattern, whereby said record of the layer k-1 data container is associated with the layer k-1 rank determined by the combination of said layer k rank with the position represented by said number.

22. A method according to claim 21, wherein said data contained in the first field of a record of the layer 1 data container for retrieving the position of any integer of an identifier list within a subset comprise a bitmap segment in which each bit is associated with a respective integer of said subset to indicate whether said integer represents an identifier of said list.

23. A method according to claim 21, wherein each record of the layer n data container associated with a layer n rank further has a second field for containing said layer n rank.

24. A method according to claim 21, wherein each layer k data container for 1.ltoreq.k<n comprises at least two files where the first and second fields of the records of said data container are respectively stored, said files being accessible separately.

25. A method according to claim 17, wherein, for 1.ltoreq.k.ltoreq.n, each record of the layer k data container further has a next address field, whereby record chains are defined in the layer k data container by means of the next address fields, and wherein at least some of the thesaurus entries are respectively associated with record chains in the layer n data container, whereby the coding data relating to one of said entries for layer n are stored in or retrievable from the record chain associated therewith in the layer n data container.

26. A method according to claim 25, wherein, for 1.ltoreq.k<n, said thesaurus entries are respectively associated with record chains in the layer k data container, whereby the coding data relating to one of said entries for layer k are stored in or retrievable from the record chain associated therewith in the layer k data container.

27. A method according to claim 25, wherein, for 1<k.ltoreq.n, each record of the layer k data container further has a head address field for pointing to an address of a first record of a respective chain in the layer k-1 data container.

28. A method according to claim 25, wherein each layer k data container for 1.ltoreq.k.ltoreq.n comprises at least two files where the first fields and the next address fields of the records of said data container are respectively stored, said files being accessible separately.

29. A method according to claim 1, wherein an integer range covering the identifiers designating the rows of the reference table is partitioned into a plurality of predetermined portions, wherein at least some of the data representing identifier lists are distributed into a plurality of storage sections respectively associated with said portions, wherein a storage section associated with one of said portions contains data representing identifier sub-lists consisting of identifiers of said portion.

30. A method according to claim 29, wherein a respective storage unit is provided for each of said portions of the reference table row identifier range, to receive the storage sections associated with said portion.

31. A method according to claim 30, wherein at least some of the macroword thesauruses have a plurality of sections respectively associated with said portions, wherein a section, associated with one of said portions, of a macroword thesaurus associated with an attribute and with a prefix length has a respective entry for each prefix value having said prefix length and matching at least a beginning portion of at least one word assigned to said data attribute in a reference table row to which an identifier of said portion is allocated, said entry containing data for retrieving an identifier sub-list from the storage section associated with said portion.

32. A method according to claim 31, further comprising the step of storing a word thesaurus associated with said one of the attributes, said word thesaurus having a respective entry for each word assigned at least once to said attribute in the collection of data, said entry containing data representing an identifier list including each row identifier designating a row of the reference table having said word in the column associated with said attribute, wherein at least one word thesaurus has a plurality of sections respectively associated with said portions, wherein a section, associated with one of said portions, of a word thesaurus associated with an attribute has a respective entry for each word assigned to said attribute in a reference table row to which an identifier of said portion is allocated, said entry containing data for retrieving an identifier sub-list from the storage section associated with said portion.

33. A method according to claim 29, wherein each thesaurus entry has a plurality of fields respectively associated with said portions, for containing data for retrieving respective identifier sub-lists from the storage sections.

34. A method of processing a query in a database system, wherein a plurality of row identifiers are defined to designate respective rows of a reference table having columns respectively associated with data attributes, said rows containing groups of related words assigned to said attributes in a collection of data, wherein a plurality of thesauruses each associated with a respective attribute and data representing reference table row identifier lists respectively associated with entries of said thesauruses are stored, wherein each thesaurus associated with one attribute is defined with reference to a partition into subsets of a set of words which can be assigned to said one attribute and has a respective entry for each subset including at least one word assigned to said one attribute in the collection of data, the reference table row identifier list associated with said thesaurus entry including any identifier allocated to a row of the reference table having a word of said subset assigned to said one attribute,

wherein the thesaurus include at least one macroword thesaurus associated with an attribute and with a prefix length shorter than a word length of said attribute, whereby said macroword thesaurus is defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length, the method comprising the steps of:

analyzing query criteria to determine a combination involving thesaurus entries relevant to the query;

determining a matching reference table row identifier list based on said combination and on the stored data representing the reference table row identifier lists associated with said relevant thesaurus entries; and

processing said matching row identifier list to output a response.

35. A method according to claim 34, wherein at least one attribute has a plurality of macroword thesauruses, associated with different prefix lengths.

36. A method according to claim 34, wherein the step of analyzing the query criteria comprises, for at least one attribute referred to in said criteria:

selecting at least one range of words defined for said attribute in the query criteria; and

mapping the words of the selected range which are assigned to said attribute in the collection of data with one or more subsets, the thesaurus entry for each of said one or more subset being retained as a relevant entry for the selected range,

and wherein the step of determining the matching row identifier list comprises merging respective portions of the identifier lists represented by the data of the relevant thesaurus entries retained for said selected range.

37. A method according to claim 36, wherein the mapping is performed so as to retain a minimum number of relevant thesaurus entries for each selected range.

38. A method according to claim 36, wherein each thesaurus associated with an attribute is defined with reference to a partition such that each subset consists of one word or of consecutive words of the set of words which can be assigned to said attribute, the entries of said thesaurus being sorted based on the words assigned to said attribute, and wherein the step of analyzing the query criteria comprises at least one dichotomy search in at least one thesaurus for identifying relevant thesaurus entries.

39. A method according to claim 38, wherein the thesauruses further comprise at least one word thesaurus associated with a respective attribute, with reference to a partition into subsets each consisting of one word.

40. A method according to claim 39, wherein each word thesaurus associated with an attribute to which the default value is assigned in at least one of the reference table rows further has an entry for the default value, whereby one of said reference table row identifier lists is associated with said thesaurus entry for the default value and includes any identifier allocated to a reference table row having said default value assigned to said attribute.

41. A method according to claim 36, wherein the step of analyzing the query criteria comprises determining said combination involving relevant thesaurus entries as a tree having at least one leaf node, each leaf node corresponding to at least one relevant thesaurus entry retained for a respective attribute.

42. A method according to claim 41, wherein said tree has a plurality of nodes including said at least one leaf node and at least one operator node, each operator node representing a Boolean operator applied to at least one partial criterion represented by another node of said tree, one of the operator nodes being a root node representing all the query criteria.

43. A method according to claim 42, wherein the nodes of said tree further include at least one preset node for which a reference table row identifier list has been determined prior to said step of analyzing the query criteria.

44. A method according to claim 43, wherein the reference table row identifier list of said preset node is determined from at least one matching reference table row identifier list obtained when processing a previous query.

45. A method according to claim 42, wherein the step of determining the matching row identifier list comprises obtaining a respective identifier list for each node of said tree, whereby the identifier list obtained for each leaf node corresponding to at least one relevant thesaurus entry is the merger of respective portions of the identifier lists associated with said at least one relevant thesaurus entry, and the identifier list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the identifier lists obtained for the node representing said at least one partial criterion, said matching row identifier list being determined as the identifier list obtained for the root node.

46. A method according to claim 45, wherein each of said obtained identifier lists is produced in the form of a bitmap vector consisting of bits assigned to respective reference table rows to indicate whether the identifiers allocated to said reference table rows belong to said obtained list.

47. A method according to claim 45, wherein a coding scheme comprising n successive coding layers is used to provide coding data representing the identifier list associated with a thesaurus entry, n being a number at least equal to 1, each layer having a predetermined pattern for dividing a range covering integers of an input list of said layer into subsets, said identifier list being the input list of the first layer for said thesaurus entry, wherein for any layer other than the last layer, an integer list representing the position, in the pattern of said layer, of each subset containing at least one integer of the input list forms the input list for the next layer,

and wherein the coding data comprise, for each layer and each subset containing at least one integer of the input list, data representing the position of each integer of the input list within said subset and, at least if said layer is the last layer, data representing the position of said subset in the pattern of said layer.

48. A method according to claim 47, wherein the pattern of each layer is such that the integer subsets are consecutive intervals consisting of the same number of integers.

49. A method according to claim 48, wherein said number of integers is a whole power of 2 for each layer.

50. A method according to claim 47, wherein said data representing the position of an integer of an input list within a subset consist of a bitmap segment.

51. A method according to claim 47, wherein the step of determining the matching row identifier list comprises determining a layer n integer list for each node of said tree, whereby the layer n integer list determined for a leaf node consists of a layer n input list associated, in the coding scheme, with the merger of the identifier lists represented in the relevant thesaurus entries to which said leaf node corresponds, and whereby the layer n integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer n integer lists determined for the nodes representing said at least one partial criterion, and wherein a layer n result list is determined as the layer n integer list obtained for the root node.

52. A method according to claim 51, wherein the nodes of said tree further include at least one preset node for which a reference table row identifier list has been determined prior to said step of analyzing the query criteria, said reference table row identifier list being subjected to the coding scheme to provide a layer n input list which is determined as said layer n integer list for said preset node.

53. A method according to claim 51, wherein, in the coding scheme, the coding data representing the position of each integer of an input list within a subset for the coding layer n define a layer n bitmap segment in which each bit is associated with a respective integer of the subset to indicate whether said integer belongs to said input list, while the data representing the position of said subset in the layer n pattern comprise a layer n integer rank associated with said layer n bitmap segment, and wherein the step of determining a layer n integer list for a leaf node comprises:

initializing a layer n bitmap vector with logical zeroes;

obtaining the layer n ranks and associated bitmap segments from the coding data for each relevant thesaurus entry to which said leaf node corresponds; and

for each of said layer n ranks, superimposing the layer n bitmap segment associated therewith onto a segment of said layer n bitmap vector having a position determined by said layer n rank, the superimposition being performed according to a bitwise Boolean OR operation,

said layer n list for the leaf node corresponding to the resulting layer n bitmap vector.

54. A method according to claim 51, wherein n>1 and the step of determining the matching row identifier list further comprises, for k decreasing from n-1 to 1, determining a layer k integer list for each node of said tree, whereby the layer k integer list determined for a leaf node consists of any integer of a layer k input list, associated in the coding scheme with the identifier list represented in a relevant thesaurus entry to which said leaf node corresponds, which belongs to a layer k subset whose position is represented in the layer k+1 result list, and whereby the layer k integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer k integer lists determined for the nodes representing said at least one partial criterion, wherein a layer k result list is determined as the layer k integer list obtained for the root node, and wherein said matching row identifier list corresponds to the determined layer 1 result list.

55. A method according to claim 54, wherein the nodes of said tree further include at least one preset node for which a reference table row identifier list has been determined prior to said step of analyzing the query criteria, said reference table row identifier list being subjected to the coding scheme to provide a layer k input list which is determined as said layer k integer list for said preset node.

56. A method according to claim 54, wherein, in the coding scheme, the coding data representing the position of each integer of an input list within a subset for a coding layer k<n define a layer k bitmap segment in which each bit is associated with a respective integer of the subset to indicate whether said integer belongs to said input list, while the coding data further comprise a layer k integer rank associated with said layer k bitmap segment to represent the position of said subset in the layer k pattern, and wherein the step of determining a layer k integer list for a leaf node comprises:

initializing a layer k bitmap vector with logical zeroes;

obtaining the layer k ranks from the coding data for each relevant thesaurus entry to which said leaf node corresponds; and

selecting any obtained layer k rank belonging to the layer k+1 result list and superimposing the associated layer k bitmap segment onto a segment of said layer k bitmap vector having a position determined by the selected layer k rank, the superimposition being performed according to a bitwise Boolean OR operation,

said layer k list for the leaf node corresponding to the resulting layer k bitmap vector.

57. A method according to claim 56, wherein, for 1.ltoreq.k<n, the layer k ranks and the layer k bitmap segments associated therewith for at least one thesaurus entry are stored at corresponding addresses in distinct first and second files, and wherein the step of determining a layer k integer list for a leaf node comprises:

providing a rank table in a RAM memory, having records associated with the addresses in said first and second files;

filling the rank table by writing any selected layer k rank into the rank table record associated with the address of the selected layer k rank in said first file; and

for any record of the filled rank table containing a layer k rank and associated with an address in the second file, reading the associated layer k bitmap segment at said address in the second file and superimposing the read layer k bitmap segment onto a segment of said layer k bitmap vector having a position determined by said layer k rank.

58. A method according to claim 54, wherein the step of determining the matching row identifier list further comprises, for any coding layer k such that 1<k.ltoreq.n, determining a layer k' filtering list for k.ltoreq.k'.ltoreq.n consisting of the layer k' input list obtained by providing the layer k result list as an input list in layer k of the coding scheme,

wherein, in the coding scheme, the coding data representing the position of each integer of an input list within a subset for a coding layer k<n define a layer k bitmap segment in which each bit is associated with a respective integer of the subset to indicate whether said integer belongs to said input list, while a layer k integer rank associated with said layer k bitmap segment represents the position of said subset in the layer k pattern, and wherein the step of determining a layer k integer list for a leaf node for k<n comprises:

/a/ initializing a layer k bitmap vector with logical zeroes;

/b/ selecting the layer n ranks obtained from the coding data for each relevant thesaurus entry to which said leaf node corresponds, and setting k'=n;

/c/ for each selected layer k' rank:

/c1/ if the selected layer k' rank represents the position in the layer k' pattern of a subset which includes at least one integer of the layer k' filtering list, obtaining the layer k' bitmap segment with which the selected layer k' rank is associated;

/c2/ for any integer of the layer k' filtering list whose position within said subset is represented in said layer k' bitmap segment, selecting a respective layer k'-1 rank determined from the selected layer k' rank and said position represented in said layer k' bitmap segment;

/c3/ if k'>k+1, executing step /c/ with k' decremented by one unit; and

/c4/ if k'-1=k, obtaining any layer k bitmap segment with which a selected layer k'-1 rank is associated, and superimposing said layer k bitmap segment onto a segment of said layer k bitmap vector having a position determined by said selected layer k'-1 rank, the superimposition being performed according to a bitwise Boolean OR operation,

said layer k list for the leaf node corresponding to the resulting layer k bitmap vector.

59. A method according to claim 58, wherein, for 1.ltoreq.k<n, the layer k bitmap segments for at least one thesaurus entry are stored in at least one layer k file at addresses respectively corresponding to the layer k ranks associated therewith, and wherein, for 1.ltoreq.k<n, the step of determining a layer k integer list for a leaf node comprises:

providing a rank table in a RAM memory, having records associated with the addresses in said layer k file;

filling the rank table by writing any selected layer k rank into the rank table record associated with the address corresponding to the selected layer k rank; and

for any record of the filled rank table containing a layer k rank and associated with an address in said layer k file, reading the associated layer k bitmap segment at said address and superimposing the read layer k bitmap segment onto a segment of said layer k bitmap vector having a position determined by said layer k rank.

60. A method according to claim 34, wherein the step of processing the matching row identifier list comprises, for at least one attribute specified in the query, selecting a thesaurus associated with said attribute and detecting entries of the selected thesaurus with which identifier lists having a non-empty intersection with the matching row identifier list are associated.

61. A method according to claim 60, wherein said attribute specified in the query has Q+1 stored thesauruses associated with different prefix lengths, Q being an integer at least equal to 0, each of said Q+1 thesauruses having a thesaurus level parameter q such that 0.ltoreq.q.ltoreq.Q, whereby the prefix length is a decreasing function of the level parameter q and corresponds to a word length of said attribute for q=0, wherein each of said Q+1 thesauruses is defined with reference to a respective partition into subsets each consisting of words beginning by a common prefix having the prefix length associated with said thesaurus, the entries of said thesaurus being sorted based on the prefix values.

62. A method according to claim 61, wherein, the selected thesaurus having a level parameter QA.gtoreq.0, the detection of entries in the selected thesaurus comprises the steps of:

/a/ providing respective level q target lists and respective level q thesaurus ranges covering consecutive entries of the level q thesaurus for QA.ltoreq.q.ltoreq.Q;

/b/ initializing the level Q target list with the matching row identifier list, initializing the level parameter q with the value Q, and selecting a first entry of the level Q thesaurus range;

/c/ determining an intersection list between the level q target list and the identifier list associated with the selected entry of the level q thesaurus range;

/d/ if the intersection list determined in the preceding step /c/ is empty, selecting another entry of the level q thesaurus range and repeating step /c/;

/e/ if q is greater than QA:

/e1/ setting the level q-1 target list as equal to the intersection list determined in the preceding step /c/;

/e2/ setting the level q-1 thesaurus range as consisting of the entries of the level q-1 thesaurus relating to level q-1 prefixes which begin with the level q prefix of the selected level q thesaurus entry, and selecting a first entry of the level q-1 thesaurus range;

/e3/ decrementing q by one unit and returning to step /c/;

/f/ if q is equal to QA:

/f1/ including the selected level QA thesaurus entry in the detected entries;

/f2/ if the level Q target list is equal to the intersection list determined in the preceding step /c/, terminating the detection of entries in the selected thesaurus;

/f3/ removing the integers of the intersection list determined in the preceding step /c/ from any target list including at least one integer which is not in said intersection list;

/f4/ setting q as the smallest level parameter for which the target list includes at least one integer which is not in said intersection list;

/f5/ selecting another entry in the level q thesaurus range and returning to step /c/.

63. A method according to claim 62, wherein Q.gtoreq.1 and each thesaurus having a level parameter q.gtoreq.1 further contains, in each entry provided for a level q prefix value, data designating the entry of the level q-1 thesaurus which corresponds to the lowest or highest level q-1 prefix beginning with the level q prefix of said level q thesaurus entry, and wherein step /e2/ comprises selecting the level q-1 thesaurus entry designated in the selected level q thesaurus entry.

64. A method according to claim 62, wherein a coding scheme comprising n successive coding layers is used to provide coding data representing the identifier list associated with a level q thesaurus entry for 0.ltoreq.q.ltoreq.Q, n being a number at least equal to 1, each layer having a predetermined pattern for dividing a range covering integers of an input list of said layer into subsets, said identifier list being the input list of the first layer for said thesaurus entry, wherein for any layer other than the last layer, an integer list representing the position, in the pattern of said layer, of each subset containing at least one integer of the input list forms the input list for the next layer,

wherein the coding data comprise, for each layer and each subset containing at least one integer of the input list, data representing the position of each integer of the input list within said subset and, at least if said layer is the last layer, data representing the position of said subset in the pattern of said layer,

and wherein each level q target list forms a layer 1 and level q filtering list and is submitted as a layer 1 input list in the coding scheme for QA.ltoreq.q.ltoreq.Q to provide respective layer k and level q filtering lists for 1<k.ltoreq.n if n>1, said layer k and level q filtering list provided from a level q target list being the layer k input list obtained from said level q target list in the coding scheme.

65. A method according to claim 64, wherein the pattern of each layer is such that the integer subsets are consecutive intervals consisting of the same number of integers.

66. A method according to claim 65, wherein said number of integers is a whole power of 2 for each layer.

67. A method according to claim 64, wherein the step /c/ of determining the intersection list between a level q target list and an identifier list comprises, from k=n:

/c1/ computing a layer k intersection list between the layer k input list obtained from said identifier list in the coding scheme and the layer k and level q filtering list corresponding to said level q target list;

/c2/ if the computed layer k intersection list is empty, determining said intersection list between the level q target list and the identifier list as being empty;

/c3/ if k=1, determining said intersection list between the level q target list and the identifier list as the computed layer 1 intersection list; and

/c4/ if k>1, decrementing k by one unit and repeating from step /c1/.

68. A method according to claim 67, wherein, in the coding scheme, the coding data representing the position of each integer of an input list within a subset for a coding layer k.ltoreq.n define a layer k bitmap segment in which each bit is associated with a respective integer of the subset to indicate whether said integer belongs to said input list, while the data representing the position of said subset in the layer k pattern comprise a layer k integer rank associated with said layer k bitmap segment, and wherein the step /c1/ of computing a layer k intersection list between a layer k input list obtained from an identifier list in the coding scheme and a layer k and level q filtering list, represented by a first layer k bitmap vector, comprises:

initializing a second layer k bitmap vector with logical zeroes;

obtaining layer k ranks from the coding data representing said identifier list; and

selecting any obtained layer k rank which represents the position in the layer k pattern of a subset including at least one integer of said layer k and level q filtering list, obtaining the layer k bitmap segment with which the selected layer k rank is associated, and determining a segment of the second layer k bitmap vector having a position determined by the selected layer k rank by combining the obtained layer k bitmap segment with a segment of the first layer k bitmap vector having a position determined by the selected layer k rank according to a bitwise Boolean AND operation,

said layer k intersection list corresponding to the resulting second layer k bitmap vector.

69. A method according to claim 68, wherein, for 1.ltoreq.k<n, the layer k ranks and the layer k bitmap segments associated therewith for at least one thesaurus entry are stored at corresponding addresses in distinct first and second files, and wherein the step /c1/ of computing a layer k intersection list between a layer k input list obtained from an identifier list in the coding scheme and a layer k and level q filtering list, represented by a first layer k bitmap vector, comprises:

providing a rank table in a RAM memory, having records associated with the addresses in said first and second files;

filling the rank table by writing any selected layer k rank into the rank table record associated with the address of said selected layer k rank in said first file; and

for any record of the filled rank table containing a layer k rank and associated with an address in the second file, reading the associated layer k bitmap segment at said address in the second file and combining the read layer k bitmap segment with a segment of the first layer k bitmap vector having a position determined by said layer k rank according to a bitwise Boolean AND operation to determine a segment of the second layer k bitmap vector having a position determined by said layer k rank.

70. A method according to claim 68, further comprising determining a pre-filtering flag for each entry of a level q thesaurus, said pre-filtering flag having a first value when said entry is associated with a reference table row identifier list represented by coding data which do not define any layer n rank representing the position in the layer n pattern of a subset which includes at least one integer of a layer n and level q filtering list, and wherein the step /c/ of determining the intersection list between a level q target list, corresponding to said layer n and level q filtering list, and an identifier list associated with an entry of the level q thesaurus comprises determining said intersection list as being empty if the pre-filtering flag determined for said entry has said first value.

71. A method according to claim 70, wherein, for any level q thesaurus entry associated with a row identifier list represented by coding data which define a layer n rank representing the position in the layer n pattern of a subset which includes at least one integer of the layer n and level q filtering list, the layer n bitmap segment associated with said layer n rank is obtained and said first value is allocated to the pre-filtering flag determined for said entry if the obtained layer n bitmap segment does not represent the position of any integer of said layer n and level q filtering list within said subset.

72. A method according to claim 64, wherein n>1 and in the coding scheme, the coding data representing the position of each integer of an input list within a subset for a coding layer k.ltoreq.n define a layer k bitmap segment in which each bit is associated with a respective integer of the subset to indicate whether said integer belongs to said input list, while a layer k integer rank associated with said layer k bitmap segment represents the position of said subset in the layer k pattern, and wherein the step /c/ of determining the intersection list between a level q target list, corresponding to layer k and level q filtering lists represented by a respective first layer k bitmap vectors for 1.ltoreq.k.ltoreq.n, and an identifier list comprises:

/c1/ initializing a second bitmap vector with logical zeroes;

/c2/ selecting layer n ranks obtained from the coding data representing said identifier list, and setting k=n;

/c3/ for each selected layer k rank:

/c31/ if the selected layer k rank represents the position in the layer k pattern of a subset which includes at least one integer of said layer k and level q filtering list, obtaining the layer k bitmap segment with which the selected layer k rank is associated;

/c32/ for any integer of the layer k and level q filtering list whose position within said subset is represented in said layer k bitmap segment, selecting a respective layer k-1 rank determined from the selected layer k rank and said position represented in said layer k bitmap segment;

/c33/ if k>2, executing step /c3/ with k decremented by one unit; and

/c34/ if k=2, obtaining any layer 1 bitmap segment with which a selected layer 1 rank is associated, and combining the obtained layer 1 bitmap segment with a segment of the first layer 1 bitmap vector having a position determined by said layer 1 rank according to a bitwise Boolean AND operation to determine a segment of the second bitmap vector having a position determined by said layer 1 rank,

said intersection list corresponding to the resulting second bitmap vector.

73. A method according to claim 72, wherein the layer 1 bitmap segments for at least one thesaurus entry are stored in at least one layer 1 file at addresses respectively corresponding to the layer 1 ranks associated therewith, and the step /c/ of determining an intersection list comprises:

providing a rank table in a RAM memory, having records associated with the addresses in said layer 1 file;

filling the rank table by writing any layer 1 rank selected in step /c32/ into the rank table record associated with the address corresponding to the selected layer 1 rank; and

for any record of the filled rank table containing a layer 1 rank and associated with an address in said layer 1 file, reading the associated layer 1 bitmap segment at said address and combining the read layer 1 bitmap segment with a segment of the first layer 1 bitmap vector having a position determined by said layer 1 rank according to a bitwise Boolean AND operation to determine a segment of the second layer 1 bitmap vector having a position determined by said layer 1 rank.

74. A method according to claim 72, further comprising determining a pre-filtering flag for each entry of a level q thesaurus, said pre-filtering flag having a first value when said entry is associated with a reference table row identifier list represented by coding data which do not define any layer n rank representing the position in the layer n pattern of a subset which includes at least one integer of a layer n and level q filtering list, and wherein the step /c/ of determining the intersection list between a level q target list, corresponding to said layer n and level q filtering list, and an identifier list associated with an entry of the level q thesaurus comprises determining said intersection list as being empty if the pre-filtering flag determined for said entry has said first value.

75. A method according to claim 74, wherein, for any level q thesaurus entry associated with a row identifier list represented by coding data which define a layer n rank representing the position in the layer n pattern of a subset which includes at least one integer of the layer n and level q filtering list, the layer n bitmap segment associated with said layer n rank is obtained and said first value is allocated to the pre-filtering flag determined for said entry if the obtained layer n bitmap segment does not represent the position of any integer of said layer n and level q filtering list within said subset.

76. A method according to claim 60, wherein the step of processing the matching row identifier list further comprises writing output data associated with any detected entry of a selected thesaurus into an output table.

77. A method according to claim 76, wherein the output table includes a respective row corresponding to each identifier of the matching row identifier list, and wherein output data associated with a detected entry of a selected thesaurus are written into any row of the output table corresponding to a reference table row identifier belonging to both the matching row identifier list and the identifier list associated with said detected thesaurus entry.

78. A method according to claim 77, wherein each reference table row identifier has a respective row of the output table corresponding thereto, wherein the rows of the output table are initialized with a default value before writing the output data, and wherein the rows of the output table which do not contain the default value are read after writing the output data.

79. A method according to claim 77, wherein the output table is associated with an index file having a respective record for each reference table row identifier, containing either a default value or a pointer designating a respective row of the output table corresponding to said reference table row identifier, wherein the records of the index file are initialized with a default value before writing the output data, and wherein the step of writing output data associated with a detected entry of a first selected thesaurus comprises, for each reference table row identifier belonging to both the matching row identifier list and the identifier list associated with said detected entry of the first selected thesaurus:

allocating an available row of the output table to correspond to said reference table row identifier,

writing output data into the allocated row; and

writing a pointer to the allocated row into the record of the index file provided for said reference table row identifier.

80. A method according to claim 77, wherein the output table has a plurality of columns each associated with a respective attribute for which a thesaurus is selected, and wherein output data associated with a detected entry of a thesaurus selected for an attribute associated with a column of the output table are written into said column.

81. A method according to claim 79, wherein the output table has a plurality of columns each associated with a respective attribute for which a thesaurus is selected, wherein output data associated with a detected entry of a thesaurus selected for an attribute associated with a column of the output table are written into said column, and wherein the step of writing output data associated with a detected entry of at least one second selected thesaurus comprises, for each reference table row identifier belonging to both the matching row identifier list and the identifier list associated with said detected entry of the second selected thesaurus:

reading the pointer contained in the record of the index file provided for said reference table row identifier; and

writing output data into the row of the output table designated by said pointer.

82. A method according to claim 76, wherein the selected thesaurus being a word thesaurus defined with reference to a partition into subsets each consisting of one word, the output data associated with a detected entry comprise the word for which said detected entry is provided.

83. A method according to claim 76, wherein the selected thesaurus being a macroword thesaurus associated with a prefix length and defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length, the output data associated with a detected entry comprise the prefix value for which said detected entry is provided.

84. A method according to claim 76, wherein the output data associated with a detected entry comprise an address of said detected entry in the selected thesaurus.

85. A method according to claim 76, wherein the output data associated with a detected entry of a selected thesaurus comprise a numerical value derived from said thesaurus entry.

86. A method according to claim 85, wherein, for a detected entry of at least one selected thesaurus, said numerical value is calculated by applying a mathematical function to a thesaurus value stored in said entry.

87. A method according to claim 85, wherein, for a detected entry of at least one selected thesaurus, said numerical value is calculated by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in the output table.

88. A method according to claim 85, wherein the output table includes a respective row corresponding to each identifier of the matching row identifier list, and wherein a numerical value derived from a detected thesaurus entry is written into any row of the output table corresponding to a reference table row identifier belonging to both the matching row identifier list and the identifier list associated with said detected thesaurus entry.

89. A method according to claim 88, wherein the numerical value, derived from a detected entry of a first selected thesaurus and written into any row of the output table corresponding to a reference table row identifier belonging to both the matching row identifier list and the identifier list associated with said entry of the first selected thesaurus, is obtained from a thesaurus value stored in said entry,

and wherein the numerical value, derived from a detected entry of at least one second selected thesaurus and written into a row of the output table corresponding to a reference table row identifier belonging to both the matching row identifier list and the identifier list associated with said entry of the second selected thesaurus, is calculated by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in said row of the output table.

90. A method according to claim 88, further comprising calculating an output value from a set of numerical values which have been respectively written into the rows of the output table.

91. A method according to claim 34, wherein an integer range covering the identifiers designating the rows of the reference table is partitioned into a plurality of predetermined portions, wherein at least some of the data representing identifier lists are distributed into a plurality of storage sections respectively associated with said portions, wherein a storage section associated with one of said portions contains data representing identifier sub-lists consisting of identifiers of said portion,

and wherein the step of determining a matching row identifier list is executed separately for the different portions of the reference table row identifier range, by means of the respective storage sections.

92. A method according to claim 91, wherein the step of processing the matching row identifier is at least partially executed separately for the different portions of the reference table row identifier range, by means of the respective storage sections.

93. A method according to claim 91, wherein the thesauruses have a plurality of sections respectively associated with said portions, wherein a section, associated with one of said portions, of a thesaurus associated with an attribute and defined with reference to a partition into subsets has a respective entry for each subset of said partition which includes at least one word assigned to said attribute in a reference table row to which an identifier of said portion is allocated, said entry containing data representing an identifier sub-list including each identifier of said portion allocated to a reference table row having a word of said subset assigned to said attribute, and wherein the step of analyzing the query criteria is at least partially executed separately for the different portions of the reference table row identifier range, by means of the respective thesaurus sections.

94. A method according to claim 91, wherein the separate step executions are carried out in parallel by respective processors for the different portions of the reference table row identifier range.

95. A method according to claim 94, wherein each thesaurus entry has a plurality of fields respectively associated with said portions, for containing data for retrieving respective identifier sub-lists from the storage sections, wherein the step of analyzing the query criteria is executed centrally for all the portions of the reference table row identifier range, and wherein the relevant thesaurus entries used by a processor executing the step of determining a matching row identifier list by means of a storage section are designated by the data for retrieving identifier sub-lists from said storage section.

96. A method according to claim 95, wherein the step of analyzing the query criteria is executed by a query server connected to said processors through a communication network.

97. A method according to claim 96, wherein a list update server is connected, through the communication network, to a plurality of storage units respectively coupled to said processors, the list update server controlling the storage units to maintain the storage sections.

98. A method according to claim 34, wherein said reference table is a virtual table which is not stored.

99. A database system, for managing information from a collection of data, wherein a plurality of row identifiers are defined to designate respective rows of a reference table having columns respectively associated with data attributes, said rows containing groups of related words assigned to said attributes in the collection of data, the database system comprising:

means for storing a plurality of thesauruses respectively associated with attributes of said group, wherein each thesaurus associated with an attribute is defined with reference to a partition into subsets of a set of words which can be assigned to said attribute and has a respective entry for each subset including at least one word assigned to said attribute in the collection of data; and

means for storing data representing identifier lists respectively associated with the thesaurus entries, wherein the identifier list associated with a entry, relating to a subset, of a thesaurus associated with an attribute includes any row identifier designating a row of the reference table having a word of said subset assigned to said attribute,

and wherein the thesaurus include at least one macroword thesaurus associated with an attribute and with a prefix length shorter than a word length of said attribute, whereby said macroword thesaurus is defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length.

100. A database system according to claim 99, wherein said reference table is a virtual table which is not stored.

101. A database system according to claim 100, further comprising means for storing a link table having a plurality of rows respectively associated with the rows of the reference table and a plurality of columns respectively associated with attribute sub-groups, wherein each row of the link table contains, in each one of the columns, either a value indicating that each attribute value represented in the associated reference table row and assigned to an attribute of the sub-group associated with said one of the columns is a default value or a link value for retrieving at least one stored word of the collection of data represented in the associated reference table row and assigned to an attribute of the sub-group associated with said one of the columns.

102. A database system according to claim 101, further comprising means for storing a respective data table for each of the attribute sub-groups, and wherein each link value contained in a column of the link table associated with an attribute sub-group comprises data for identifying a row of the data table stored for said sub-group.

103. A database system according to claim 99, further comprising query processing means including:

means for analyzing query criteria to determine a combination involving thesaurus entries relevant to a query;

means for determining a matching reference table row identifier list based on said combination and on the stored data representing the reference table row identifier lists associated with said relevant thesaurus entries; and

output means for processing said matching row identifier list to output a response.

104. A database system according to claim 103, wherein at least one attribute has a plurality of macroword thesauruses, associated with different prefix lengths.

105. A database system according to claim 103, wherein the means for analyzing the query criteria comprises, for at least one attribute referred to in said criteria:

means for selecting at least one range of words defined for said attribute in the query criteria; and

mapping means for mapping the words of the selected range which are assigned to said attribute in the collection of data with one or more subsets, and for retaining the thesaurus entry for each of said one or more subset being retained as a relevant entry for the selected range,

and wherein the means for determining the matching row identifier list comprise means for merging respective portions of the identifier lists represented by the data of the relevant thesaurus entries retained for said selected range.

106. A database system according to claim 105, wherein the mapping means are arranged to retain a minimum number of relevant thesaurus entries for each selected range.

107. A database system according to claim 105, wherein each thesaurus associated with an attribute is defined with reference to a partition such that each subset consists of one word or of consecutive words of the set of words which can be assigned to said attribute, the entries of said thesaurus being sorted based on the words assigned to said attribute, and wherein the means for analyzing the query criteria comprises dichotomy search means in at least one thesaurus for identifying relevant thesaurus entries.

108. A database system according to claim 105, wherein the means for analyzing the query criteria comprise means for determining said combination involving relevant thesaurus entries as a tree having at least one leaf node, each leaf node corresponding to at least one relevant thesaurus entry retained for a respective attribute.

109. A database system according to claim 108, wherein said tree has a plurality of nodes including said at least one leaf node and at least one operator node, each operator node representing a Boolean operator applied to at least one partial criterion represented by another node of said tree, one of the operator nodes being a root node representing all the query criteria.

110. A database system according to claim 109, wherein the means for determining the matching reference table row identifier list comprise means for obtaining a respective identifier list for each node of said tree, whereby the identifier list obtained for each leaf node corresponding to at least one relevant thesaurus entry is the merger of respective portions of the identifier lists associated with said at least one relevant thesaurus entry, and the identifier list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the identifier lists obtained for the node representing said at least one partial criterion, said matching row identifier list being determined as the identifier list obtained for the root node.

111. A database system according to claim 110, wherein each of said obtained identifier lists is produced in the form of a bitmap vector consisting of bits assigned to respective reference table rows to indicate whether the identifiers allocated to said reference table rows belong to said obtained list.

112. A database system according to claim 110, wherein a coding scheme comprising n successive coding layers is used to provide coding data representing the identifier list associated with a thesaurus entry, n being a number at least equal to 1, each layer having a predetermined pattern for dividing a range covering integers of an input list of said layer into subsets, said identifier list being the input list of the first layer for said thesaurus entry, wherein for any layer other than the last layer, an integer list representing the position, in the pattern of said layer, of each subset containing at least one integer of the input list forms the input list for the next layer,

and wherein the coding data comprise, for each layer and each subset containing at least one integer of the input list, data representing the position of each integer of the input list within said subset and, at least if said layer is the last layer, data representing the position of said subset in the pattern of said layer.

113. A database system according to claim 112, wherein the pattern of each layer is such that the integer subsets are consecutive intervals consisting of the same number of integers.

114. A database system according to claim 113, wherein said number of integers is a whole power of 2 for each layer.

115. A database system according to claim 112, wherein the means for determining the matching row identifier list comprise means for determining a layer n integer list for each node of said tree, whereby the layer n integer list determined for a leaf node consists of a layer n input list associated, in the coding scheme, with the merger of the identifier lists represented in the relevant thesaurus entries to which said leaf node corresponds, and whereby the layer n integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer n integer lists determined for the nodes representing said at least one partial criterion, and wherein a layer n result list is determined as the layer n integer list obtained for the root node.

116. A database system according to claim 115, wherein the nodes of said tree further include at least one preset node for which a reference table row identifier list has been determined in advance, said reference table row identifier list being subjected to the coding scheme to provide a layer n input list which is determined as said layer n integer list for said preset node.

117. A database system according to claim 115, wherein n>1 and the means for determining the matching row identifier list further comprise, for k decreasing from n-1 to 1, means for determining a layer k integer list for each node of said tree, whereby the layer k integer list determined for a leaf node consists of any integer of a layer k input list, associated in the coding scheme with the identifier list represented in a relevant thesaurus entry to which said leaf node corresponds, which belongs to a layer k subset whose position is represented in the layer k+1 result list, and whereby the layer k integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer k integer lists determined for the nodes representing said at least one partial criterion, wherein a layer k result list is determined as the layer k integer list obtained for the root node, and wherein said matching row identifier list corresponds to the determined layer 1 result list.

118. A database system according to claim 117, wherein the nodes of said tree further include at least one preset node for which a reference table row identifier list has been determined in advance, said reference table row identifier list being subjected to the coding scheme to provide a layer k input list which is determined as said layer k integer list for said preset node.

119. A database system according to claim 103, wherein the output means comprise means for selecting a thesaurus associated with an attribute specified in the query and means for detecting entries of the selected thesaurus with which identifier lists having a non-empty intersection with the matching row identifier list are associated.

120. A database system according to claim 119, wherein the output means further comprise means for writing output data associated with any detected entry of a selected thesaurus into an output table.

121. A database system according to claim 120, wherein the output table includes a respective row corresponding to each identifier of the matching row identifier list, and wherein said means for writing output data are arranged to write output data associated with a detected entry of a selected thesaurus into any row of the output table corresponding to a reference table row identifier belonging to both the matching row identifier list and the identifier list associated with said detected thesaurus entry.

122. A database system according to claim 121, wherein the output table has a plurality of columns each associated with a respective attribute for which a thesaurus is selected, and wherein said means for writing output data are arranged to write into a column of the output table output data associated with a detected entry of a thesaurus selected for an attribute associated with said column.

123. A database system according to claim 120, wherein the selected thesaurus being a word thesaurus defined with reference to a partition into subsets each consisting of one word, the output data associated with a detected entry comprise the word for which said detected entry is provided.

124. A database system according to claim 120, wherein the selected thesaurus being a macroword thesaurus associated with a prefix length and defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length, the output data associated with a detected entry comprise the prefix value for which said detected entry is provided.

125. A database system according to claim 120, wherein the output data associated with a detected entry comprise an address of said detected entry in the selected thesaurus.

126. A database system according to claim 120, wherein the output data associated with a detected entry of a selected thesaurus comprise a numerical value derived from said thesaurus entry.

127. A database system according to claim 126, further comprising means for calculating said numerical value for a detected entry of at least one selected thesaurus by applying a mathematical function to a thesaurus value stored in said entry.

128. A database system according to claim 126, further comprising means for calculating said numerical value for a detected entry of at least one selected thesaurus by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in the output table.

129. A database system according to claim 126, wherein the output table includes a respective row corresponding to each identifier of the matching reference table row identifier list, and wherein said means for writing output data are arranged to write a numerical value derived from a detected thesaurus entry into any row of the output table corresponding to a reference table row identifier belonging to both the matching reference table row identifier list and the identifier list associated with said detected thesaurus entry.

130. A database system according to claim 129, wherein the numerical value, derived from a detected entry of a first selected thesaurus and written into any row of the output table corresponding to a reference table row identifier belonging to both the matching reference table row identifier list and the identifier list associated with said entry of the first selected thesaurus, is obtained from a thesaurus value stored in said entry,

and wherein the numerical value, derived from a detected entry of at least one second selected thesaurus and written into a row of the output table corresponding to a reference table row identifier belonging to both the matching reference table row identifier list and the identifier list associated with said entry of the second selected thesaurus, is calculated by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in said row of the output table.

131. A database system according to claim 129, further comprising means for calculating an output value from a set of numerical values which have been respectively written into the rows of the output table.

132. A database system according to claim 103, wherein the means for storing data representing identifier lists comprise a plurality of storage sections respectively associated with distinct portions of an integer range covering the identifiers allocated to the reference table rows according to a predetermined partition, wherein each storage section associated with one of said portions contains data representing identifier sub-lists consisting of identifiers of said portion,

and wherein the means for determining the matching reference table row identifier list comprise means for determining respective matching identifier sub-lists included in the different portions of the reference table row identifier range, by means of the respective storage sections.

133. A database system according to claim 132, comprising a processor to determine said matching identifier sub-lists sequentially.

134. A database system according to claim 132, comprising a plurality of processors respectively associated to the different portions of the reference table row identifier range, to determine said matching identifier sub-lists in parallel.

135. A database system according to claim 134, wherein the means for analyzing query criteria comprise a query server connected to said processors through a communication network.

136. A database system according to claim 135, further comprising a list update server connected, through the communication network, to a plurality of storage units respectively coupled to said processors, the list update server controlling the storage units to maintain the storage sections.

137. A computer program product for managing information from a collection of data, wherein a plurality of row identifiers are defined to designate respective rows of a reference table having columns respectively associated with data attributes, said rows containing groups of related words assigned to said attributes in the collection of data, the computer program product comprising:

instructions for storing a plurality of thesauruses respectively associated with attributes of said group, wherein each thesaurus associated with an attribute is defined with reference to a partition into subsets of a set of words which can be assigned to said attribute and has a respective entry for each subset including at least one word assigned to said attribute in the collection of data; and

instructions for storing data representing identifier lists respectively associated with the thesaurus entries, wherein the identifier list associated with a entry, relating to a subset, of a thesaurus associated with an attribute includes any row identifier designating a row of the reference table having a word of said subset assigned to said attribute,

and wherein the thesaurus include at least one macroword thesaurus associated with an attribute and with a prefix length shorter than a word length of said attribute, whereby said macroword thesaurus is defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length.

138. A computer program product according to claim 137, wherein said reference table is a virtual table which is not stored.

139. A computer program product according to claim 138, further comprising instructions for storing a link table having a plurality of rows respectively associated with the rows of the reference table and a plurality of columns respectively associated with attribute sub-groups, wherein each row of the link table contains, in each one of the columns, either a value indicating that each attribute value represented in the associated reference table row and assigned to an attribute of the sub-group associated with said one of the columns is a default value or a link value for retrieving at least one stored word of the collection of data represented in the associated reference table row and assigned to an attribute of the sub-group associated with said one of the columns.

140. A computer program product according to claim 139, further comprising instructions for storing a respective data table for each of the attribute sub-groups, and wherein each link value contained in a column of the link table associated with an attribute sub-group comprises data for identifying a row of the data table stored for said sub-group.

141. A computer program product according to claim 137, further comprising instructions for processing a query including:

instructions for analyzing query criteria to determine a combination involving thesaurus entries relevant to a query;

instructions for determining a matching reference table row identifier list based on said combination and on the stored data representing the reference table row identifier lists associated with said relevant thesaurus entries; and

instructions for outputting a response by processing said matching row identifier list.

142. A computer program product according to claim 141, wherein the instructions for analyzing the query criteria comprises, for at least one attribute referred to in said criteria:

instructions for selecting at least one range of words defined for said attribute in the query criteria; and

instructions for mapping the words of the selected range which are assigned to said attribute in the collection of data with one or more subsets, and for retaining the thesaurus entry for each of said one or more subset being retained as a relevant entry for the selected range,

and wherein the instructions for determining the matching row identifier list comprises instructions for merging respective portions of the identifier lists represented by the data of the relevant thesaurus entries retained for said selected range.

143. A computer program product according to claim 142, wherein the instructions for mapping the words of the selected range are arranged to retain a minimum number of relevant thesaurus entries for each selected range.

144. A computer program product according to claim 142, wherein each thesaurus associated with an attribute is defined with reference to a partition such that each subset consists of one word or of consecutive words of the set of words which can be assigned to said attribute, the entries of said thesaurus being sorted based on the words assigned to said attribute, and wherein the instructions for analyzing the query criteria comprise instructions for performing a dichotomy search in at least one thesaurus for identifying relevant thesaurus entries.

145. A computer program product according to claim 142, wherein the instructions for analyzing the query criteria comprise instructions for determining said combination involving relevant thesaurus entries as a tree having at least one leaf node, each leaf node corresponding to at least one relevant thesaurus entry retained for a respective attribute.

146. A computer program product according to claim 145, wherein said tree has a plurality of nodes including said at least one leaf node and at least one operator node, each operator node representing a Boolean operator applied to at least one partial criterion represented by another node of said tree, one of the operator nodes being a root node representing all the query criteria.

147. A computer program product according to claim 146, wherein the instructions for determining the matching reference table row identifier list comprise instructions for obtaining a respective identifier list for each node of said tree, whereby the identifier list obtained for each leaf node corresponding to at least one relevant thesaurus entry is the merger of respective portions of the identifier lists associated with said at least one relevant thesaurus entry, and the identifier list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the identifier lists obtained for the node representing said at least one partial criterion, said matching row identifier list being determined as the identifier list obtained for the root node.

148. A computer program product according to claim 147, wherein each of said obtained identifier lists is produced in the form of a bitmap vector consisting of bits assigned to respective reference table rows to indicate whether the identifiers allocated to said reference table rows belong to said obtained list.

149. A computer program product according to claim 147, wherein a coding scheme comprising n successive coding layers is used to provide coding data representing the identifier list associated with a thesaurus entry, n being a number at least equal to 1, each layer having a predetermined pattern for dividing a range covering integers of an input list of said layer into subsets, said identifier list being the input list of the first layer for said thesaurus entry, wherein for any layer other than the last layer, an integer list representing the position, in the pattern of said layer, of each subset containing at least one integer of the input list forms the input list for the next layer,

and wherein the coding data comprise, for each layer and each subset containing at least one integer of the input list, data representing the position of each integer of the input list within said subset and, at least if said layer is the last layer, data representing the position of said subset in the pattern of said layer.

150. A computer program product according to claim 149, wherein the pattern of each layer is such that the integer subsets are consecutive intervals consisting of the same number of integers.

151. A computer program product according to claim 150, wherein said number of integers is a whole power of 2 for each layer.

152. A computer program product according to claim 149, wherein the instructions for determining the matching row identifier list comprise instructions for determining a layer n integer list for each node of said tree, whereby the layer n integer list determined for a leaf node consists of a layer n input list associated, in the coding scheme, with the merger of the identifier lists represented in the relevant thesaurus entries to which said leaf node corresponds, and whereby the layer n integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer n integer lists determined for the nodes representing said at least one partial criterion, and wherein a layer n result list is determined as the layer n integer list obtained for the root node.

153. A computer program product according to claim 152, wherein the nodes of said tree further include at least one preset node for which a reference table row identifier list has been determined in advance, said reference table row identifier list being subjected to the coding scheme to provide a layer n input list which is determined as said layer n integer list for said preset node.

154. A computer program product according to claim 152, wherein n>1 and the instructions for determining the matching row identifier list further comprise, for k decreasing from n-1 to 1, instructions for determining a layer k integer list for each node of said tree, whereby the layer k integer list determined for a leaf node consists of any integer of a layer k input list, associated in the coding scheme with the identifier list represented in a relevant thesaurus entry to which said leaf node corresponds, which belongs to a layer k subset whose position is represented in the layer k+1 result list, and whereby the layer k integer list obtained for each operator node representing a Boolean operator applied to at least one partial criterion is obtained by applying said Boolean operator to the layer k integer lists determined for the nodes representing said at least one partial criterion, and instructions for determining a layer k result list as the layer k integer list obtained for the root node, and wherein said matching row identifier list corresponds to the determined layer 1 result list.

155. A computer program product according to claim 141, wherein said instructions for outputting a response comprise instructions for selecting a thesaurus associated with an attribute specified in the query and instructions for detecting entries of the selected thesaurus with which identifier lists having a non-empty intersection with the matching row identifier list are associated.

156. A computer program product according to claim 155, wherein said instructions for outputting a response further comprise instructions for writing output data associated with any detected entry of a selected thesaurus into an output table.

157. A computer program product according to claim 156, wherein the output table includes a respective row corresponding to each identifier of the matching row identifier list, and wherein said instructions for writing output data are arranged to control the writing of output data associated with a detected entry of a selected thesaurus into any row of the output table corresponding to a reference table row identifier belonging to both the matching row identifier list and the identifier list associated with said detected thesaurus entry.

158. A computer program product according to claim 157, wherein the output table has a plurality of columns each associated with a respective attribute for which a thesaurus is selected, and wherein said instructions for writing output data are arranged to control the writing into a column of the output table of output data associated with a detected entry of a thesaurus selected for an attribute associated with said column.

159. A computer program product according to claim 156, wherein the selected thesaurus being a word thesaurus defined with reference to a partition into subsets each consisting of one word, the output data associated with a detected entry comprise the word for which said detected entry is provided.

160. A computer program product according to claim 156, wherein the selected thesaurus being a macroword thesaurus associated with a prefix length and defined with reference to a partition into subsets each consisting of words beginning by a common prefix having said prefix length, the output data associated with a detected entry comprise the prefix value for which said detected entry is provided.

161. A computer program product according to claim 156, wherein the output data associated with a detected entry comprise an address of said detected entry in the selected thesaurus.

162. A computer program product according to claim 156, wherein the output data associated with a detected entry of a selected thesaurus comprise a numerical value derived from said thesaurus entry.

163. A computer program product according to claim 162, further comprising instructions for calculating said numerical value for a detected entry of at least one selected thesaurus by applying a mathematical function to a thesaurus value stored in said entry.

164. A computer program product according to claim 162, further comprising instructions for calculating said numerical value for a detected entry of at least one selected thesaurus by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in the output table.

165. A computer program product according to claim 162, wherein the output table includes a respective row corresponding to each identifier of the matching reference table row identifier list, and wherein said instructions for writing output data are arranged to control the writing of a numerical value derived from a detected thesaurus entry into any row of the output table corresponding to a reference table row identifier belonging to both the matching reference table row identifier list and the identifier list associated with said detected thesaurus entry.

166. A computer program product according to claim 165, wherein the numerical value, derived from a detected entry of a first selected thesaurus and written into any row of the output table corresponding to a reference table row identifier belonging to both the matching reference table row identifier list and the identifier list associated with said entry of the first selected thesaurus, is obtained from a thesaurus value stored in said entry,

and wherein the numerical value, derived from a detected entry of at least one second selected thesaurus and written into a row of the output table corresponding to a reference table row identifier belonging to both the matching reference table row identifier list and the identifier list associated with said entry of the second selected thesaurus, is calculated by applying a mathematical function to a plurality of values including a thesaurus value stored in said entry and at least one value already present in said row of the output table.

167. A computer program product according to claim 165, further comprising instructions for calculating an output value from a set of numerical values which have been respectively written into the rows of the output table.

168. A computer program product according to claim 141, wherein the instructions for storing data representing identifier lists are arranged to supervise a plurality of storage sections respectively associated with distinct portions of an integer range covering the identifiers allocated to the reference table rows according to a predetermined partition, wherein each storage section associated with one of said portions contains data representing identifier sub-lists consisting of identifiers of said portion,

and wherein the instructions for determining the matching reference table row identifier list comprise instructions for determining respective matching identifier sub-lists included in the different portions of the reference table row identifier range, by means of the respective storage sections.

169. A computer program product according to claim 168, wherein said matching identifier sub-lists are determined sequentially.

170. A computer program product according to claim 168, wherein said matching identifier sub-lists are determined in parallel.


Description

BACKGROUND OF THE INVENTION

The present invention relates to relational database management systems (RDBMS), and more particularly to computerized systems for storing and accessing large amounts of data.

In a non-limiting manner, the invention is applicable to "data warehouses". On-line transaction processing (OLTP) systems, such as for bank teller transactions and airline reservations, are optimized for finding a record associated with a specific key, e.g. finding the information about employee 123124. By contrast, data warehouses are optimized for finding sets of records very quickly. The reason is that typical queries are of the form: "find all sales by region and quarter" or "find stores that sell the greatest volume of sportswear per month" or "select the top 5 stores for each product category for the last year". Such queries must typically access large sets of rows in data tables. The query processing challenge is to process these queries without doing a linear scan of all or most of the database.

Five main approaches have been proposed to attack this problem: (i) multidimensional arrays; (ii) special indexes; (iii) table caching; (iv) optimized foreign key joins; and (v) approximation.

(i) Multidimensional Arrays (i.e. Matrices)

This strategy consists of implementing the data warehouse as a multidimensional array or matrix. Examples may be found in U.S. Pat. Nos. 5,359,724 and 5,864,857. Each dimension corresponds to an attribute of the data. For example, a sales table can be viewed as a matrix with coordinates: store location, product type, customer id, and so on. A particular sale can be identified by specifying all of these attributes. The strategy works well for small databases or very dense ones. By dense, we mean that the Cartesian product of possible values should all be meaningful, e.g., every customer is likely to buy every product from every store. Since this is rarely true, this scheme must be modified to deal with sparse values. This can be done by defining a notion of sparse attributes and dense ones. So, for example, it might be that every store carries every product (a dense relationship that can be stored in a matrix), but only some of these combinations are valid for any given customer. So, a conventional index would be used whenever customer sales are involved, but a dense one for queries involving store-wide or product-wide sales.

(ii) Special Indexes

Bitmap indexes are an index structure tailored to data warehouses (see, e.g. U.S. Pat. No. 5,903,888). These indexes have already been used in some commercial products to speed up query processing. In its simplest form, a bitmap index on an attribute consists of one vector of bits (i.e. bitmap) per attribute value, where the size of each bitmap is equal to the number of records in the indexed relation. For example, if the attribute is day-of-week, then there would be seven bitmap vectors for that attribute, one for each day. The bitmap vector corresponding to Monday would have a 1 at position i if record i contains "Monday" in the day-of-week attribute. This single value-based approach is called a Value-List index. Other techniques (e.g. U.S. Pat. No. 5,761,652) associate bit vectors with ranges of values, so there could, for a salary attribute, be a vector for the range 0 to 20,000 Euros, 20,000.01 to 35,000 Euros, and so on. Still others associate each bit vector with a bit value (a 1 or a 0) in a given position. So, if the attribute holds n bit numbers, then there would be 2n bit vectors (position 1, bit value 1; position 1, bit value 0; position 2 bit value 1; . . . ).

The benefit of bit vectors is that it is easy to use multiple bit vectors to answer a single query. Consider a query on several predicates, each of which is indexed. Most conventional database management systems would use just one of the indexes (the one that is most "selective" so returns the fewest rows), though some systems might attempt to intersect the record identifiers of multiple indexes.

Bitmaps work better, because they are more compact and intersecting several bitmaps is much faster than intersecting several collections of record identifiers. In the best case, the improvement is proportional to the word size of the machine. For example, suppose the word size is 32 bits. Then two bit vectors can be intersected 32 bits at a time. Each set of 32 bits corresponds to 32 record identifiers being intersected. That best case occurs when each predicate is unselective (i.e. many records match each predicate value), but all the predicates together are quite selective. Consider for example the query: "Find people who have brown hair, glasses, ages between 30 and 40, blue eyes, work in the computer industry, live in California, . . . ".

So, matrices are best when sets of predicates are dense (all, or nearly all, values in the Cartesian product are possible), bitmaps are best when predicates are neither dense nor individually selective. An intermediate approach (when there is insufficient density for matrices but many values in the Cartesian product are present) is to use multidimensional indexes. Multidimensional indexes such as quadtrees, R-trees and their successors are implemented as variable sized grids on a multidimensional space. The grids are of variable sizes because the population of points differs in different places in a hyperspace. For intuition, consider a map of equi-population rectangles of France. The rectangles would be far more dense in Paris than in the alps. Indexes like this work well for spatial data (where they are used to find the points contained in latitude-longitude quadrants). This alternative is little explored in the commercial arena except for geographical queries, however, because these schemes do not scale well with increasing dimensionality and commercial systems typically have far more than three dimensions.

(iii) Table Caching

If one doesn't have the luxury to design new indexes on top of a database system (because one is not the implementer of that system) one can pre-compute a large number of anticipated aggregate queries and put them in tables. For example, if a large retailer frequently asks queries that sum the total sales across multiple stores or multiple products, one may store such information in special tables. The main cost of such a strategy is maintaining these tables in the face of updates. (Disk space is no longer a major factor.) In the example, every sale of item I at store S would have to update the total product sales table for I and the total store sales table for S. So, this strategy is worthwhile if there are few updates between queries. The strategy is not worthwhile if there are many.

(iv) Optimized Foreign Key Joins

Most queries in multidimensional tables entail joins between a central "fact table" (e.g. sales detail) and a set of dimension tables (e.g. store description, product description, customer description). These are known as "foreign key joins" since the customer identifier in the sales table, for example, is a key of the customer description table. (A key is a value belonging to an attribute such that only one record has that value in the attribute.) One way to accelerate these joins is to create a linkage between fact table records and dimension records. This can be done in three basic ways

(a) create an index that holds fact table record identifiers and dimension table record identifiers;

(b) create bidirectional pointers between fact table records and dimension table rows--this is what "object-oriented" databases do;

(c) replace the customer record identifiers in the fact table by offsets into the dimension tables.

Choice (a) is the most independent of changes in the physical organization of the tables and therefore is best for heavily updated systems, because changes to the dimension table can be reflected in the index to that table alone. Choice (b) is the least flexible to physical reorganization, because reorganizing a dimension table would entail updating the fact table. Choice (c) is a compromise of the two in that certain physical reorganizations can be done to the dimension tables (e.g. changing its position on disk) without changing the fact table. Examples of join optimization may be found in U.S. Pat. Nos. 5,548,754, 5,671,403, 5,724,568, 5,752,017, 5,761,657 and 5,822,747.

(v) Approximating the Result

Since most people use data warehouses to get strategic aggregate information, many would be happy with a fast approximation as long as it has error bounds. Typical work in this area is illustrated by U.S. Pat. No. 5,870,752, which shows how to estimate aggregate results in data warehouses while giving error bounds. The basic problem is that sampling all tables and then doing aggregates does not work in general. For example, if one wants to join R and S on their keys, then taking a 1/10 sample of each will give a size that is 1/100 of the size of the real join if the samples are random. So, one must be more clever. The idea is to take an initial set of tables R, S, T, . . . that are linked by foreign key joins. Suppose for example that R is the fact table and the others are dimension tables. Take a sample of R and then perform all these foreign key joins based on the sample giving R', S', T', . . . . Now, if a query involves R, S, T and includes the foreign key links among these, then the query can be done with great accuracy on R', S', T'. The error can be estimated by considering the result obtained by several partitions of R' and looking at their variance.

An object of the present invention is to propose an alternative method of organizing a database management system, which enables an efficient query processing.

SUMMARY OF THE INVENTION

The invention proposes a method of organizing information in a database system. A plurality of row identifiers are defined to designate respective rows of a reference table having columns respectively associated with data attributes. These rows contain groups of related words assigned to the attributes in a collection of data. The method comprises storing at least one macroword thesaurus associated with one of the attributes and with a prefix length shorter than a word length of said attribute.

The macroword thesaurus has a respective entry for each prefix value having said prefix length and matching a corresponding prefix of at least one word assigned to said data attribute in the collection of data.

Data representing identifier lists respectively associated with the macroword thesaurus entries are also stored. The identifier list associated with an entry, relating to a prefix value, of a macroword thesaurus associated with an attribute includes any row identifier designating a row of the reference table having a word whose corresponding prefix matches said prefix value in the column associated with said attribute.

Another aspect of the invention relates to a method of processing queries in a database system in which the information is organized as indicated hereabove. A plurality of thesauruses each associated with a respective attribute and data representing reference table row identifier lists respectively associated with entries of these thesauruses are stored. Each thesaurus associated with one attribute is defined with reference to a partition into subsets of a set of words which can be assigned to that attribute and has a respective entry for each subset including at least one word assigned to the attribute in the collection of data. The reference table row identifier list associated with such thesaurus entry includes any identifier allocated to a row of the reference table having a word of the subset assigned to the attribute.

The thesaurus includes at least one macroword thesaurus associated with an attribute and with a prefix length shorter than a word length of that attribute. This macroword thesaurus is defined with reference to a partition into subsets each consisting of words beginning by a common prefix having the corresponding prefix length. The query processing method comprises the steps of:

analyzing query criteria to determine a combination involving thesaurus entries relevant to the query;

determining a matching reference table row identifier list based on such combination and on the stored data representing the reference table row identifier lists associated with the relevant thesaurus entries; and

processing matching row identifier list to output a response.

The invention further proposes a database system implementing methods as outlined above, and computer program products having instructions for carrying out such method.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1-3 show an example of data structure as typically used in a conventional relational database system.

FIG. 4 is a diagram representing a data table tree in the example of FIGS. 1-3.

FIGS. 5-7 are diagrams showing respective data graphs constructed with the tree of FIG. 4 and the data of FIGS. 1-3.

FIG. 8 is a flat file representation of the data tables of FIGS. 1-3.

FIG. 9 shows a link table as used in an embodiment of the invention.

FIGS. 10A-H show the contents of thesauruses corresponding to the data tables of FIGS. 1-3.

FIGS. 11A-14A, 11G-14G and 11H-14H show other representations of the thesauruses of FIGS. 10A, 10G and 10H, respectively.

FIGS. 15-16 illustrate the data stored in a data container in connection with the thesauruses of FIGS. 14A, 14G and 14H.

FIG. 17 shows another possible structure of the thesaurus of FIGS. 10A-14A.

FIG. 18 is a block diagram of a computer system suitable for implementing the invention.

FIG. 19 is a flow chart showing a data graph creation procedure in accordance with an embodiment the invention.

FIG. 20 is a flow chart showing a procedure applicable in stage 124 of FIG. 19.

FIGS. 21 and 22 are flow charts showing procedures applicable in step 136 of FIG. 20.

FIGS. 23 and 24 are flow charts showing another procedure applicable in step 136 of FIG. 20 in two successive coding layers.

FIGS. 25-32 are tables showing a way of storing thesauruses constructed from the example of FIGS. 1-3.

FIG. 33 is a flow chart showing an alternative way of executing steps 135 and 136 of FIG. 20 when the thesauruses are stored as shown in FIG. 17.

FIGS. 34A and 34B are tables showing an alternative embodiment of the tables of FIGS. 31-32.

FIGS. 34C and 34D are another representation of the tables of FIGS. 34A and 34B.

FIG. 35 is a flow chart showing a procedure applicable in the management of tables of the type shown in FIGS. 34A and 34B.

FIG. 36 is a general flow chart of a query processing procedure in accordance with an embodiment of the invention.

FIG. 37 is a diagram showing an example of query tree referring to the example of FIGS. 1-3.

FIG. 38 is another diagram showing an expanded query tree obtained by analyzing the query tree of FIG. 37.

FIG. 39 is a flow chart showing a procedure of analyzing the query tree.

FIG. 40, which is obtained by placing FIG. 40A above FIG. 40B, is a flow chart of a recursive function referred in the procedure of FIG. 39.

FIG. 41 is the flow chart procedure for identifying matching data graphs based on an expanded query tree as illustrated in FIG. 38.

FIG. 42 is a flow chart of a recursive function FNODE called to in the procedure of FIG. 41.

FIGS. 43-45 are flow charts illustrating procedures executed in steps 262, 264 and 265 of FIG. 42, respectively.

FIG. 46 is a flow chart showing an alternative embodiment of the procedure of step 265 of FIG. 42.

FIG. 47 is a flow chart showing another alternative embodiment of the procedure of step 265 of FIG. 42, when the thesauruses are stored as illustrated in FIGS. 34A and 34B.

FIG. 48 is a flow chart of a recursive function FILT called in the procedure of FIG. 47.

FIG. 49 is a flow chart showing another alternative embodiment of the procedure of step 265 of FIG. 42, when the thesauruses are stored as illustrated in FIG. 17.

FIG. 50 is a flow chart of a variant of a leaf processing used in the function of FIG. 42.

FIG. 51 is a flow chart showing a procedure applicable for scanning the thesaurus relating to a given attribute in order to retrieve the attribute values relevant to a database query.

FIG. 52 is a flow chart of a function FINTER referred to in the procedure of FIG. 51.

FIGS. 53-55 are flow charts showing procedures executed in steps 355, 357 and 358 of FIG. 52, respectively.

FIG. 56 is a flow chart showing an alternative procedure applicable in step 358 of FIG. 52, when the thesauruses are stored as illustrated in FIGS. 33-34.

FIG. 57 is a flow chart of a recursive function FFILT called in the procedure of FIG. 56.

FIGS. 58-61 show tables which may be stored to cooperate with the tables of FIGS. 25-34.

FIG. 62 is a flow chart showing a pre-filtering procedure which may be used prior to a thesaurus scanning similar to that of FIG. 51.

FIG. 63 is a flow chart showing a part of a thesaurus scanning procedure according to FIG. 51, adapted to take into account a pre-filtering according to FIG. 62.

FIG. 64 is a flow chart showing an alternative procedure applicable in step 358 of FIG. 52, when the thesauruses are stored as illustrated in FIG. 17.

FIG. 65 is a flow chart showing a procedure applicable in step 335 of FIG. 51.

FIGS. 66 and 67 show the contents of an exemplary output table used to provide a query response.

FIG. 68 is a diagram illustrating another possible structure of the output table.

FIGS. 69 and 70 are flow charts showing procedures applicable in step 335 of FIG. 51 to construct an output table of the type shown in FIG. 68.

FIG. 71 is a flow chart showing a procedure applicable in step 335 of FIG. 51 to perform computations in a database system by means of a computation table.

FIG. 72 is a block diagram of another computer system suitable for implementing the invention.

DESCRIPTION OF PREFERRED EMBODIMENTS

Virtual Data Graphs

FIGS. 1-3 illustrate a collection of data which can be stored in a computer memory coupled with a processor arranged for running relational database management programs. This example will be referred to in the following description to give an illustration of the principles and embodiments of the invention where appropriate.

FIGS. 1-3 show a conventional type of data organization in a database system. The illustrated system handles data relevant to a hypothetical insurance company which manages policies for its clients. The data are organized in three tables relating to the clients, policies and accidents as shown in FIGS. 1-3, respectively.

From a logical point of view, each data table consists of a two-dimensional matrix, with rows corresponding to respective records in the table and columns corresponding to respective data attributes of the records or structural features of the database (the latter type of column typically contains either local record identification keys or foreign keys designating records in a target table).

It will be appreciated, however, that for large databases the actual storage of the data in a memory medium, e.g. a magnetic disc, is frequently performed otherwise: each row typically has a memory address where the corresponding attribute values or keys are stored in the order of the columns and separated by predetermined symbols such as the encoded character ".backslash.".

In our simplified example given to facilitate the explanation of the proposed data structures, the tables are of modest size. In practice, there are usually more tables and more attributes (columns) per table (notwithstanding, one or more tables could also have a single column). Moreover, the data tables generally include much more records, up to thousands or millions of rows depending on the application.

In that example, the database a group of seven attributes distributed into three sub-groups corresponding to the three data tables. Each attribute has a column in the data table corresponding to its sub-group. The client data table (FIG. 1) has three attributes, i.e. client name, birth year and gender. The policy data table of FIG. 2 has two attributes, i.e. policy type ("car" or "house") and policy effect date, and a link column to the client table. The accident data table of FIG. 3 has two attributes, i.e. date of accident and amount of damages incurred in a given currency, and a link column to the policy table.

In a given data table, each record/row has a unique identifier, referred to as a row-ID. This identifier corresponds to the memory address where the record is stored, usually through a conversion table. It may be stored as an identification key in a column of the data table for the purposes of unique row identification, but this is not compulsory. In our example, the row-ID's are integer indexes starting from zero for each data table, and they are not stored explicitly in a column of the table.

Some of the tables are linked together, as indicated in the last column of FIGS. 2 and 3. Two tables are directly linked if one of them (source table) has a link column provided for containing foreign keys designating records of the other one (target table).

Those foreign keys, hereafter called links, reflect the hierarchy and organization of the data handled in the relational database system. In our example, each accident dealt with by the insurance company is related to a certain policy managed by the company, hence the policy links of FIG. 3. Each policy is for a particular client of the company, hence the client links of FIG. 2. It will be noted that some links may be optional. For example, some accidents may involve third parties and if there is a separate table for third parties, then each record of the accident table may have a link to the third party table.

Each link typically consists of a row-ID in the target data table. For instance, the accident stored as row-ID=0 in the accident table of FIG. 3, which took place on Oct. 3, 1998 for an amount of 1,000 has a policy link pointing to the policy stored as row-ID=1 in the policy table of FIG. 2, i.e. it relates to a car policy subscribed on Sep. 9, 1998 by the client with row-ID=1 in the client table of FIG. 1, i.e. Andre, a man born in 1976. If the target table has other forms of record identification keys, for example compound keys, a link may also designate a target record as identified by such a key.

The construction of the links obeys a number of rules. In particular, the linked data tables have a directed acyclic graph structure such as a hierarchical tree organization illustrated in FIG. 4. A root table is defined as a data table for which no other data table has links pointing to its rows, such as the accident table of FIG. 3. In other words, a root table does not constitute a target table. Likewise, a leaf table is defined as a data table with no link column, such as the client table of FIG. 1. In other words, a leaf table does not constitute a source table. FIG. 4 shows only one root table, but the tree structure of the tables may have multiple roots.

It may happen in certain cases that a group of related data tables exhibit circular links (for example, the client table may have a link column to the accident data table to indicate the first, or last, accident undergone by each client). In such a case, the tree organization of the data tables is first restored by canceling one link of the circle. Which link should be cancelled is dictated by the semantics of the database (in the above example, the link from the client table to the accident table will naturally be cancelled).

Paths are defined in the data table tree from the root table(s) to the leaf tables. Each path from a root table to a leaf table is defined by a link column of the root table pointing to the leaf table, or by a succession of link columns via one or several intermediate tables.

In FIG. 4, two leaf tables have been added (dashed lines) to show a tree structure with multiple branching (the simplified example of FIGS. 1-3 provides a tree with a single path shown with a solid line). The added leaf tables are a third party table as mentioned previously and a broker table which is a target table from the policy table, to contain data about the brokers who commercialize the policies.

The data table records that are linked together can be viewed in a similar tree representation (FIGS. 5-7). The record tree of FIG. 5 shows that the accident #6 was related to policy #0 (car) subscribed by client #2 (Ariane) through broker #Y and involved third party #X. The solid lines represent respective links from the data tables of FIGS. 2 and 3.

The record tree of FIG. 6 further shows a Null record which may added in the accident table with a link to row-ID=2 in the policy table, for the reason that, as apparent from the last column of FIG. 3, no accident has occurred under policy #2 (subscribed by client #4 (Max) for his house).

A Null, or dummy, record stands for the absence of data. All its attribute values are default values (Null), which means "no value". The purpose of inserting such dummy records in the present scheme is to make sure that any valid record in any data table belongs to at least one record tree stemming from a record of a root table (FIG. 4).

A Null record may also be present in each data table which is a target table for at least one link column of a source table. When a row of the source table has no foreign key in the corresponding link column, the record tree(s) including that row is (are) completed with a Null at the location of said target table. This situation occurs for the broker table in the example illustrated in FIG. 6. To represent this, a default value (e.g. -1) can be written in the link column of the source table, whereby the Null record is implicitly present in the target table.

The Null records are inserted where appropriate in a process of scanning every single path in the data table tree from the leaf table of said path to the root table, i.e. downwardly in FIG. 4. When examining one source/target table pair in the scanning of a path, the target table row-ID values that do not occur in the relevant link column of the source table are first listed, and then for each missing row-ID value of the list, a new Null record is generated in the source table with said missing row-ID value in said link column.

If a Null record is thus inserted in a data table having several link columns, the Null record receives the default value (-1) in any link column other that the one pertaining to the path being scanned, to indicate that the corresponding link is to a Null record in the target table. This situation occurs for the third party table in the example illustrated in FIG. 6.

Scanning the data table tree from the leaves to the root is important. Otherwise, Null records containing links to other Null records in a target table might be overlooked. An example is shown in FIG. 7 which shows a record tree relating to client #0 (Oscar) who has no (more) policy: the accident table contains a Null record pointing to another Null record of the policy table which, in turn, points to client #0; the root of the record tree would not be in the root (accident) table if the paths were scanned upwardly.

In a conventional database organization as shown in FIGS. 1-3, the link keys are provided to optimize the memory usage. To illustrate this, reference may be made to the flat file shown in FIG. 8, which has exactly the same informational content as the three data tables of FIGS. 1-3 (the third party and broker tables are ignored in the sequel).

A flat file has a column for each one of the attributes (columns) of the data tables. For each complete record tree that can be constructed with the data table tree structure of FIG. 4, the flat file has a row which contains, in the relevant columns, the attribute values of all the records of said tree. The rows of the flat file are referred to herein as data graphs. Each data graph is identified by a flat file row-ID shown in the left-hand portion of FIG. 8. The record trees of FIGS. 5-7 are compact representations of the data graphs at row-ID's 6, 9 and 11, respectively.

Although the flat file representation is sometimes referred to the literature, it is of little practical interest for databases of significant size. The reason is that it requires excessive redundancy in the data storage.

For example, in our small-sized case, Andre's birth year and gender, as well as the details of his car policy are written three times in the flat file (row-ID's 0, 3 and 8), whereas they are written only once, along with link values, when the storage is in the form of data tables as in FIGS. 1-3. With databases of realistic size, such redundancy is not acceptable.

The database system according to the invention makes use of the flat file concept. However, it does not require the storage of the flat file as shown in FIG. 8, hence the concept of "virtual flat file" containing "virtual data graphs" (VDG). The term "virtual" refers to the fact that the flat file or data graphs need not be maintained explicitly in memory, although their data structure is used as a reference in the execution of the method.

In a particular embodiment of the invention, the flat file is reduced to a link table as shown in FIG. 9. Each row of the link table corresponds to a respective row of the flat file, i.e. to a record tree as shown in FIGS. 5-7.

The columns of the link table respectively correspond to the data tables of FIGS. 1-3. In other words, each column of the link table is associated with an attribute sub-group which is the sub-group of attributes allocated to the corresponding (target) data table. Each column of the link table contains link values (row-ID's) designating records of the corresponding target data table.

The row of the link table corresponding to a given data graph contains a default value (-1) in the column corresponding to any data table having a Null record in the record tree representing said data graph.

The data table row-ID's found in one row of the link table enable the retrieval of linked data from the data table, i.e. a data graph or part of it. All the links are represented in the link table. If one replaces the row-ID's stored in the columns of the link table of FIG. 9 by the attribute values stored in the identified rows of the respective data tables of FIGS. 1-3, one recovers the flat file of FIG. 8.

The proposed system further uses word thesauruses (FIGS. 10A-G) each associated with a respective column of one of the data tables, i.e. with one of the attributes.

In a preferred embodiment, there is one word thesaurus for each attribute used in the database system. However, if some attributes are known to be never or almost never used in the query criteria, then it is possible to dispense with the thesaurus for such attribute.

Each word thesaurus associated with one column of a data table has an entry for each attribute value found in that column. Such attribute value is referred to herein as a "word". A word has one entry in a thesaurus, and only one, as soon as it occurs at least once in the associated data table column. The Null value is a valid word in the thesaurus.

The entries of each thesaurus are sorted on the basis of the attribute values. An order relationship is therefore defined for each attribute category. This requires attention when the attribute value fields of the thesaurus files are defined and dimensioned.

Typically, the words are in the ASCII format and their category is selected for each column among the categories "integer", "real" and "character string". Character strings are sorted according to the usual lexicographical order. A date field is preferably declared as a character string such as yyyy (mm) (dd) (FIGS. 10B, 10E and 10F), yyyy representing the year, mm the month (optionally) and dd the day in the month (optionally). The thesaurus sorting thus puts any dates in the chronological order. If the attribute category is "integer", the numbers are aligned on the right-hand digit, in order to provide the natural order relationship among the integer data values. If the attribute category is "real", the numbers are aligned according to their whole parts, with as many digits on the right as in the value having the longest decimal part in the column.

The Null value is at one end (e.g. at the beginning) of each sorted thesaurus.

Each entry E(W) for a word W in a thesaurus associated with a column C(T) of a data table T contains information for identifying every row of the flat file which has the attribute value W in the column corresponding to C(T). When the flat file is stored virtually in the form of a link table, the information contained in entry E(W) is used for identifying every row of the link table which, in the column corresponding to the data table T, has a link pointing to a row having the value W in column C(T).

In other words, with the contents of the entry E(W) in the thesaurus associated with column C(T), we can retrieve all the data graphs whose corresponding attribute has the value W.

Such contents represent a row-ID list pointing to rows of the (virtual) flat file, i.e. a data graph identifier list. Such list may be empty, in particular for the Null value in some of the thesauruses (as in FIGS. 10A-C).

Two alternative representations of the data graph identifier lists in the thesauruses are illustrated in FIGS. 10A-G for the seven attribute columns of FIGS. 1-3. The first one is the form of explicit integer lists.

The second (equivalent) representation is in the form of bitmap vectors whose length is equal to (or greater than) the number of rows in the virtual flat file, i.e. the number of data graphs. The bit of position i in a bitmap vector (i.gtoreq.0) indicates whether the integer i belongs (1) or not (0) to the row-ID list represented by the bitmap vector. In our simplified example, the flat file has 12 rows so that the bitmap vectors may be of dimension 12.

The above-described data structure, comprising a virtual flat file and sorted thesaurus files pointing to rows of the virtual flat file is referred to herein as a VDG structure.

The VDG structure provides a powerful tool for efficiently processing queries in the database.

The virtual flat file is a reference table which defines a unified algebraic framework for the entries of all the thesauruses. The query criteria are examined with reference to the relevant thesauruses to obtain a flat file row-ID list (or bitmap vector) which represents all data graphs matching the query criteria, if any. The results can then be delivered by accessing the link table rows pointed to in that row-ID list to read the links which appear in part or all of the columns in order to retrieve attributes values as desired for the result presentation.

The processing with reference to the thesauruses mainly consists in logical operations performed on the row-ID lists to which they point. If they are represented as integer lists, such operations can be reduced to basic merge, intersect and/or complement operations, which respectively correspond to Boolean OR, AND, NOT operations in the bitmap representation.

The VDG structure also provides an efficient tool for accessing the contents of the database, which does not require accesses to the data tables. This tool is well suited to queries hav