Database system with methodology for storing a database table by vertically partitioning all columns of the table5794229Abstract A Client/Server Database System with improved methods for performing database queries, particularly DSS-type queries, is described. The system includes one or more Clients (e.g., Terminals or PCs) connected via a Network to a Server. In general operation, Clients store data in and retrieve data from one or more database tables resident on the Server by submitting SQL commands, some of which specify "queries"--criteria for selecting particular records of a table. The system implements methods for storing data vertically (i.e., by column), instead of horizontally (i.e., by row) as is traditionally done. Each column comprises a plurality of "cells" (i.e., column value for a record), which are arranged on a data page in a contiguous fashion. By storing data in a column-wise basis, the system can process a DSS query by bringing in only those columns of data which are of interest. Instead of retrieving row-based data pages consisting of information which is largely not of interest to a query, column-based pages can be retrieved consisting of information which is mostly, if not completely, of interest to the query. The retrieval itself can be done using more-efficient large block I/O transfers. The system includes data compression which is provided at the level of Cache or Buffer Managers, thus providing on-the-fly data compression in a manner which is transparent to each object. Since vertical storage of data leads to high repetition on a given data page, the system provides improved compression/decompression. Claims What is claimed is: Description COPYRIGHT NOTICE
______________________________________
SELECT Age, Gender, SUM(Revenue), COUNT(*)
FROM Customers
WHERE State IN ('MA', 'NY', 'RI', 'CT')
AND Status = 'Active'
GROUP BY Age, Gender;
SELECT State, Product, SUM(Revenue), AVG(Price)
FROM Customers
WHERE Product <> 'b'
AND Code = 1
GROUP BY State, Product
______________________________________
In such a query--one which is typical for DSS--storing data on a column basis by cell is far more preferable than storing data in the traditional row or record format. The foregoing query entails looking at 50% or greater of the data and includes GROUP BY clauses (e.g., on State or Gender) and includes SUM, COUNT, and AVG clauses (e.g., average on Revenue). Processing such a query using traditional storage methodology is highly inefficient, as row-based (horizontal) data pages bring in from disk all columns of data, whether or not particular columns are of interest to the query. By storing data in a column-wise basis, the system of the present invention can process the foregoing query by bringing in only those columns of data which are of interest. Further, storing data in a column-wise basis leads to high repetition of data and, thus, far better compression than can be realized with row-based storage. Moreover, it has been observed that the vast majority of information in real-world DSS applications is low cardinality data; for example, a State field has only 50 unique values. The majority (e.g., 80% or more) of the columns of a typical DSS table will store low cardinality data. As a result, repetition within a particular column is quite high. The nature of the data encountered, therefore, further enhances the advantages of column-wise storage or vertical partitioning. Referring now to FIGS. 3A-C, vertical partitioning or column-based storage in accordance with the present invention will now be diagrammatically illustrated. FIG. 3A shows Customers table as a logical view--that is, a table view in familiar column/row format. As shown, the table 300 includes fields of Customer (Cust) ID, Name, Street, City, State, Zip, Age, Gender, and so forth and so on. For tables employed for DSS purposes, it is not uncommon for tables to include 50 or more fields. Moreover, a large number of rows is typical for tables employed for DSS purposes, often exceeding 1 million rows. As shown in FIG. 3A, row-based (horizontal) data storage of the table 300 entails storing each row in a data page, such as the data page 320. In particular, the data page 320 comprises a page header (e.g., storing housekeeping information) and one or more rows from the table 300. As shown in particular detail, the data page 320 lays down each row in its data page, one after another. Within each of the rows stored in the data page is the collection of particular data values for that row (i.e., record). The data page 320 is conventionally connected to other data pages, such as data page 310 (left neighbor) and data page 330 (right neighbor). In this manner, the pages form a single page chain 350. Actual connection between the pages is typically achieved using conventional page pointers, such as the forward-referencing page pointers shown for page 310 (i.e., which points to page 320) and for page 320 (i.e., which points to page 330). Modification of data storage, in accordance with the present invention, is shown in particular detail in FIG. 3B. Instead of storing the data members of each row together on a data page, the data members for each column are stored together as "cells" on a data page. As shown in FIG. 3A for the table 300, for instance, the Customer (Cust) ID column values are stored together on data page 370. Like the row-based data page (e.g., data page 320), the column-based data page 370 includes a page header (e.g., for housekeeping information) and a page pointer. The data page 370 is connected to other data pages which store Customer (Cust) ID column values, such as data page 360 and data page 380, for forming page chain 355, as shown. As before, each data page may include a page pointer for referencing the next data page in the page chain. In a similar manner, other columns may be stored. As shown in FIG. 3C, for instance, the State field of table 300 is stored in a page chain 356, with each page in the chain storing State data values in cells. As shown in the figure, for instance, each particular State data value (i.e., one column value for each row in logical view 300) is stored in a cell of data page 371. The data page 371 may be connected or linked to other data pages (in a manner similar to that previously shown), such as data page 361 and data page 381 for forming the page chain 356. Both the data pages of FIG. 3B and of FIG. 3C are shown in uncompressed format. For data which is suitable for compression, however, the data would be compressed before stored on data pages, as described in further detail hereinbelow. Referring now to FIGS. 4A-D, a first level of data compression--natural data reduction--in accordance with the present invention will now be described for vertically stored data. Before conventional compression methodology is applied to the column data values, the present invention recognizes that, given the new format for storing data (i.e., vertically), many data types may be subjected to a "natural data reduction." Consider the data values from the Customer (Cust) ID field, as shown in FIG. 4A. Typically, these whole number or integer values would be represented in most digital computer implementations as 32-bit (or more) integers. A 32-bit integer has the capability of representing a Customer ID up to about four billion. The user's actual data, however, never approaches this limit. As a result, a large number of the bits employed for representing Customer ID data values are simply unused bits, as shown in FIG. 4A. In accordance with the present invention, therefore, these unused bits may be dropped for achieving a first level of data compression. For storage of customer ID, for instance, the first 20 bits may be dropped from each data value upon storage in the data page. This is shown in particular detail for the data page 410 of FIG. 4A. When the data value is restored from the data page (i.e., expanded from a cell on the data page to a data value in memory), the data value can be re-expanded to its native size (e.g., 32-bit integer), so that it can participate as a data member of the appropriate size (i.e., for its data type), thus keeping the natural data reduction compression transparent when the data is used by the client. As shown in FIG. 4B, the data which has undergone a first-level compression (i.e., natural data reduction) can now undergo a second-level compression, using known compression methodology. As shown in FIG. 4B, for instance, the data page 410 may undergo further compression to a data page 420a with leading key (prefix) compression, a data page 420b with LZW compression, a data page 420c with LZS (LZRW1) compression, and/or a data page 420d with Run-Length Encoding (RLE) compression. When the particular data page is restored from disk (i.e., loaded by a Buffer Manager into memory), the respective decompression methodology would be employed, followed by restoring the unused bits (in the event that natural data reduction compression is also employed). FIG. 4C illustrates that other columns of the table 300 are suitable for natural data reduction. For the Age field or column, for instance, only the lower 7 or 8 bits of a 32-bit integer would be employed, as customers are generally under the age of 128 (and certainly under the age of 256). FIG. 4D illustrates that, in addition to natural data reduction of unused bits, further data reduction can be realized by the clustering of values for the data values of a particular column. Consider the data values for a Price column or field, for instance. Here, the values are not entirely random. It would be unlikely, for example, to have a price of $34.23. Instead, values tend to cluster around certain points, such as $99.95, $99.99, $100.00. As a result, the bits employed for representing a domain of values can be reduced. Moreover, this clustering brought on by vertical storage serves to "pre-condition" the data for enhancing the compression realized by conventional methodology. In particular, vertical storage enhances the ability of conventional compression methodology to reduce redundancies in data, thus leading to enhanced compression. FIG. 5 illustrates that, in a preferred embodiment, compression is applied on a per page basis. Consider, for instance, table (view) 501, which includes four columns. In accordance with the present invention, the table is stored as a plurality of vertical page chains 503, as shown. Here, compression is provided on a per page basis, with the particular compression type for a given page being indicated by a CType data member in each page header. The actual page type for each page is indicated by the BType data member, which is also stored in the page header. For the page 511, for instance, the page type is bit map (BType=bit map) and the compression type is Run-Length Encoding (CType=RLE). For page 521, on the other hand, page type is data (e.g., alphanumeric user data) and compression type is LZRW1. Within the very same vertical page chain, however, compression type can be changed, since it is provided on a per page basis. Thus as shown by page 522, compression type is switched to LZW. 3. Logical tables Traditional database systems maintain a system catalog for keeping track of tables, at a physical level (i.e., physical tables). For a particular table, such a system can find from the catalog a chain of pages (i.e., physical chain of pages) which represent that table. The system of the present invention adopts a similar approach, but with an important difference. In the system of the present invention, from an SQL system catalog, the system can determine a chain of columns which represent a particular table. From the chain of columns, it can locate the chain of pages for those columns. Effectively, through the same conventional mechanism as employed for tracking a traditional table, the system of the present invention is also able to track its tables, but with the important difference that the catalog is organized to optimize for DSS (i.e., by implementing vertical partitioning). Thus at a core level--at the level of the system catalog--the organization is changed to emphasize the notion of columns, instead of tables. Here, the concept of a "table" in the system of the present invention is purely a catalog logical entry, as opposed to a physical entity in which it traditionally exists. The columns are "tied together" logically to represent a table. Each cell within a column (i.e., column value for a record) is arranged on a data page in a contiguous fashion. For fixed-length fields (e.g., two character State field), the offset of any particular cell on a page can be quickly computed as a modulus of that cell (adjusted for any header information to the page). Between the individual cells, there is no overhead information, in contrast to several row-based implementations. Since the cells, in essence, exist as a solid stream of data, column scans are particularly efficient. The pages are optimized for compression by storing in the page header a status flag indicating whether the data page is a candidate for compression and (optionally) what type of compression is best suited for the data on that page. Since this is settable on a page-by-page basis, one particular compression technique can be applied on one column yet at the same time another compression technique applied on another column, all within the same (logical) table. In addition to sequentially scanning the pages, the system also allows random access. In particular, a B-Tree data structure is maintained; the key stored in the B-Tree is the logical page number (e.g., block 1, block 2, and so forth and so on). The logical block number translates into a physical page number. For a page chain comprising 4 million pages, therefore, the system can access a particular page (e.g., the 1 millionth page) without having to scan all pages. This is useful for supporting more traditional, "needle-in-a-haystack" queries. The column-wise approach to storing table information improves ALTER table operations, such as for adding or deleting columns. In a traditional system, the task of adding or deleting columns is quite expensive, as this requires the system to touch every row of every data page, for removing or adding the particular row. In the system of the present invention, in contrast, a column is added simply as a new entry is added to the system catalog. In a similar fashion, a column is dropped by updating the system catalog and, additionally, deleting the data pages for that particular row. With vertical partitioning, therefore, table ALTER operations change from being very slow to being very quick. This is particularly important in the context of DSS applications as data warehouse data typically has tables of numerous columns of historical data and, hence, tends to undergo much more schema changes over time. 4. Buffer Manager modified to provide native data compression Data compression is added to the system at the level of the Cache or Buffer Managers. It is preferable to isolate compression here so that each object need not be concerned about compressing itself (or even being aware that compression is occurring). As a result, compression is transparently added to all data objects managed by Buffer Managers. The data pages of an object are compressed when sent out to disk and decompressed when retrieved from disk, yet the object itself is unaware of this action. Most objects within the system, such as tables, indexes, logs, and the like, exist as pages. As these objects are streamed to disk, each simply requests its Buffer Manager to store the object on disk. The Manager in turn stores the object on disk using the best compression methodology known to it, for the given object. In this manner, data compression is transparent to the data objects above the level of the Buffer Manager. Actual compression methodology can be provided using commercially available compression/decompression libraries. In a preferred embodiment, the methodology employed is LZS221 (available from Stac Electronics of San Diego, Calif.), which is described in U.S. Pat. Nos. 4,701,745, 5,016,009, 5,126,739, and 5,146,221. LZRW1, a similar methodology, can be employed. See Williams, Ross, An Extremely Fast Data Compression Algorithm, Proceedings from IEEE Computer Society Data Compression Conference, Snowbird, Utah, April 1991; additional material available from Internet FTP site FTP.adelaide.edu.au ("compression" subdirectory). Alternatively, LZW compression/decompression, which is described in U.S. Pat. No. 4,558,302, may be employed; or PKZIP compression/decompression (available from PKWare of Brown Deer, Wis.), which is described in U.S. Pat. No. 5,051,745, may be employed. The disclosures of each of the foregoing are hereby incorporated by reference. With column-formatted storage, as taught by the present invention, data is preferably compressed/decompress using the LZS (or related LZRW1) methodology. The LZS approach exhibits good compression/decompression throughput. More particularly, better-than average decompression times are realized with column-formatted data storage, as indicated by the following benchmarks.
TABLE 1
__________________________________________________________________________
LZW
LZRW1
RLE
GZIP1
GZIP3
GZIP6
PKZIP
__________________________________________________________________________
.sub.-- BArrayCompressedLength (MB)
395
488 481
410 402 385 427
.sub.-- BArrayCompressTime (s.)
1741
236 564
1982
2617
5973
11631
.sub.-- BArrayDataLength (MB)
969
1055
588
1158
1158
1158
1110
.sub.-- BArrayDecompressTime (s.)
597
157 75 481 444 411 1812
Compression speed (KB/s)
557
4343
1043
585 442 194 95
Decompression speed (KB/s)
1623
6720
7840
2407
2608
2818
613
.sub.-- BMLeafCompressedLength (MB)
59 78 44 139 140 137 57
.sub.-- BMLeafCompressTime (s.)
300
40 62 518 644 1185
1746
.sub.-- BMLeafDataLength (MB)
188
188 150
284 284 284 194
.sub.-- BMLeafDecompressTime (s.)
115
28 17 100 98 96 251
.sub.-- BMSegCompressedLength (MB)
0.6
6.6 0.07
0.3 0.3 0.1 0.4
.sub.-- BMsegCompressTime (s.)
63 5 2 27 24 53 55
.sub.-- BMSegDataLength (MB)
50 50 50 50 50 50 50
.sub.-- BMSegDecompressTime (s.)
27 6 5 3 4 6 11
.sub.-- BTreeCompressedLength (MB)
31 35 0.9
20 22 20 23
.sub.-- BTreeCompressTime (s.)
153
17 1.4
117 178 523 2266
.sub.-- BTreeDataLength (MB)
86 86 8.7
86 86 86 86
.sub.-- BTreeDecompressTime (s.)
55 15 0.7
31 29 28 105
__________________________________________________________________________
Full TPC-D database
LZW LZRW1
__________________________________________________________________________
Compress 1:18:51
54:58
Blocks 121302 120452
Decompress 4:47 2:52 (2 columns of lineitem table)
__________________________________________________________________________
Decompression time is particularly important in DSS environments as there are usually more readers than writers. Internal operation A. Buffer Manager 1. Overview of s.sub.-- bufman and s.sub.-- buf The construction and internal operation of the present invention is perhaps best understood by examining the C++ objects/classes which provide the functionality necessary for implementing the above-described improved Decision Support query performance. Central to operation of the system of the present invention is an improved Cache or Buffer Manager, which in the system of the present invention is implemented as a plurality of Buffer Managers, each comprising a Buffer Manager and one or more buffers. A first object, created from a s.sub.-- bufman C++ class, is an object which serves as the Cache or Buffer Manager proper. In a preferred embodiment, multiple instances of s.sub.-- bufman objects can be created. Each is a complete Buffer Manager in its own right. When an object instance of class s.sub.-- bufman is created, the object receives a list of one or more files for which it is to manage a cache buffer for. The files themselves may include logical files (e.g., UNIX files), raw partitions, or the like. The s.sub.-- bufman object functions both as a Buffer Manager and as an interface to a page allocator (i.e., interface to disk storage). When starting with a completely empty database, an object of type s.sub.-- bufman is instantiated by the system by invoking a "Create" method of the s.sub.-- bufman object, s.sub.-- bufman::Create. This call, in turn, returns a pointer to an object of class s.sub.-- buf--a buffer or cache. An s.sub.-- bufman object contains or controls a cache of s.sub.-- buf. During operation, in other words, s.sub.-- bufman includes a Create method which creates an instance of s.sub.-- buf, a buffer in memory. Two parameters are specified at creation of an s.sub.-- buf cache object: the physical block size (i.e., the minimum size of a physical I/O operation) and how many physical blocks equal a "page." The former indicates how big the page size is. In an exemplary embodiment, page sizes are multiples of the physical block size (that the Buffer Manager has been told that it can use); in an exemplary embodiment, the default size is 4K. The default size for a page, in turn, is 16 blocks. The page size is, therefore, 64K (i.e., 16 times 4K). Although the system can create buffers of different page sizes, preferably all have the same fundamental block size which each is based on. When a buffer is created in memory, it is not necessarily streamed out to disk. Therefore, when the Create method is called, the system does not at that point find a physical location on disk to store the object. Only upon occurrence of an explicit write call (e.g., when "paging out") does the system invoke the page allocator for putting the information in one of the files which the object serves as the Buffer Manager for. The Manager automatically determines when the object is stored to disk. When written to disk, if the object has not been allocated a slot on disk (i.e., a physical address or page number on disk), the Manager will (in conjunction with a page manager) allocate one. The only way a user or client can get a buffer is through the Create (or Find) method described below. The Create method returns an s.sub.-- buf pointer, in essence giving the user or client access to a buffer which is in memory. The buffer is returned pinned or locked (so that the pointer remains valid). The Create and Find calls automatically perform the task of pinning or locking the buffer in memory. Later, when the buffer is freed, it will be unlocked or unpinned. For this purpose, the s.sub.-- buf class includes an unlock method. To actually find something in the buffer, a Find method of the Buffer Manager is invoked. Specifically, the system invokes an s.sub.-- bufman::Find method, passing it a page number or ID for the page being sought. One of two things can happen, a cache "hit" or cache "miss" occurs. Internally, the Find method looks at its own cache to determine whether a cache hit or miss occurs. From its own hash table, the Buffer Manager (s.sub.-- bufman) can determine cache hit or miss based on the page number. In the instance of a cache miss, for example, the Buffer Manager finds no entry in its own internal hash table for the sought-after page; accordingly, it does not have the item in cache. The manager must go to disk to "grab" or read the page. The actual call to read a page when a miss occurs invokes s.sub.-- buf::Read, an internal routine or method of s.sub.-- buf. It is at this subroutine that the system invokes additional subroutines for decompressing the sought-after page. Actual implementation of compression and decompression is performed by a separate module which includes Compress and Decompress methods. In a preferred embodiment, the compressed version of the page is retrieved into a separate buffer (i.e., separate from s.sub.-- buf). The system then decompresses the sought-after page from the separate buffer into the s.sub.-- buf buffer. The reason for using the two-buffer approach for decompression is as follows. In a preferred embodiment, a "pre-imaging" strategy is employed for pre-imaging things which are already compressed. If a client does a "read" operation followed by a "dirty" soon thereafter, the system need not recompress that particular page. The system maintains a cache of the most recently compressed version. When the system pre-images the page for transaction image processing, the system need not perform the I/O operation again (i.e., of the original, compressed data), nor need it decompress the compressed data. In the preferred embodiment, a "dirty" mechanism is called before a change is made. This is perhaps best explained by examining the update path--how things are written out--in the system. In the instance of a cache hit, the task of the Buffer Manager is simple. In a cache hit for a Find operation, the Buffer Manager does a hash to the page, which is resident in the buffer in an uncompressed form. The Buffer Manager can, therefore, return an s.sub.-- buf pointer to that page. Although a user (client) of the Buffer Manager can "read" particular pages, the Buffer Manager does not surface a "Write" method to its users. Instead, the Buffer Manager assumes responsibility for writing out pages at appropriate times. For instance, when insufficient room exists in the cache (e.g., when bringing in a new page), the Buffer Manager will "paged out" pages automatically, according to a Least-Recently Used (LRU) or other aging scheme. Additionally, at "commit" time (i.e., when a transaction commits), the Buffer Manager schedules all the writes which have not been performed yet. In essence, therefore, the user simply modifies (i.e., "dirties") pages, and the Buffer Manager assumes responsibility for the actual task of writing pages to disk. In this manner, the Buffer Manager provides automatic writing of pages, without publishing a "Write" method to its users. By specifying a "commit," however, a client or user can inform the Buffer Manager that the buffer has been changed and that the user is done. In essence, this informs the Buffer Manager that it can now flush the buffer (e.g., in preparation for the next transaction). In the case where the user desires to change data, it obtains an s.sub.-- buf pointer, by invoking the Create or Find methods. The user data itself resides in the buffer. Before the user actually modifies that data, it first invokes a Dirty method. From the s.sub.-- buf object, the user receives a pointer to a space which it is allowed to modify (usually a subset of a 64K space). By invoking s.sub.-- buf::Dirty, the user can gain the privilege to modify the data. The Dirty method performs the "pre-imaging"--creating a before image (i.e., compressed version, before change), for crash recovery. Additionally, the Dirty method toggles a flag in the s.sub.-- buf buffer, for indicating that the buffer is now "dirty"; therefore, if it is "paged out," it needs to be written. When pre-imaging, the system looks for a hit in the cache of already compressed pages. If one is not found (i.e., "miss"), then the system will perform the necessary I/O operation to get the compressed version (and cache it as a pre-image). Actual writing of a buffer is done at the level of s.sub.-- buf, with each buffer (i.e., s.sub.-- buf object) including an internal Write method for writing itself out, the class method s.sub.-- buf::Write. This method is invoked for "flush," "commit," and "page out" operations. Like the Read method, the Write method invokes subroutine calls to the low-level compression routines or methods. After the system is done with a buffer, it calls a Destroy method. In a preferred embodiment, more than one version of the Destroy method is provided. In a first method of Destroy, a s.sub.-- bufman::Destroy method can be invoked with the s.sub.-- buf pointer. This method frees up the buffer and gives back the page which that buffer belongs to, to the page allocator or manager (i.e., free list manager). In the second version of the destroy method, a page ID or number is passed. This version otherwise performs the same functionality as the first version. 2. Detailed construction and operation of s.sub.-- bufman a. Class definition In an exemplary embodiment, the s.sub.-- bufman class may be constructed as follows (using the C++ programming language).
__________________________________________________________________________
1:
/*******************************************************************
2:
s.sub.-- bufman -- the main interface to the outside world
3:
*******************************************************************/
4:
5:
class s.sub.-- bufman : public hos.sub.-- error, public s.sub.--
bufman.sub.-- stats
6:
{
7:
8:
friend class s.sub.-- buf;
9:
friend class s.sub.-- bufcache ;
10:
friend class s.sub.-- blockmap ;
11:
friend class s.sub.-- blockmap.sub.-- identity ;
12:
friend class s.sub.-- bufpool.sub.-- ts ;
13:
14:
// . . .
15:
16:
public: // s.sub.-- bufman interface
17:
18: s.sub.-- bufman(hos.sub.-- int maxbuffers, hos.sub.-- int
blocksize,
19: hos.sub.-- boolean rwaccess, hos.sub.-- memloc in.sub.-- shrmem=HOS
.sub.-- PRVMEM);
20: .about.s.sub.-- bufman( );
21: s.sub.-- bufman(s.sub.-- bufman &);
22:
23: // . . .
24:
25: s.sub.-- buf*
Create(hos.sub.-- bio*, s.sub.-- btype, hos.sub.-- int
nBlks,
26: hos.sub.-- cmprstype = HOS.sub.-- CMPRST.sub.-- ANY) ;
27: s.sub.-- buf*
Create(s.sub.-- blockmap*, hos.sub.-- bio*, s.sub.-- btype,
28: hos.sub.-- cmprstype = HOS.sub.-- CMPRST.sub.-- ANY);
29: s.sub.-- blockmap*
CreateBlockMap(hos.sub.-- bio*, s.sub.-- blockmap.sub.--
identity&,
30: hos.sub.-- int minBlocks,
31: hos.sub.-- int chunkSize, hos.sub.-- int maxSlots) ;
32:
33: void
Destroy(hos.sub.-- bio *biop, hos.sub.-- uint block, s.sub.--
btype btype,
34: hos.sub.-- int nBlks) ;
35: void
Destroy(s.sub.-- bufPtr& sb);
36: void
Destroy(s.sub.-- buf *sb); // use only for a member variable
37:
38: s.sub.-- buf*
Find(hos.sub.-- bio*, hos.sub.-- uint block, s.sub.-- btype
btype,
39: hos.sub.-- int nLogicalBlks) ;
40: s.sub.-- blockmap* FindBlockMap(hos.sub.-- bio*, const s.sub.--
blockmap.sub.-- identity&,
41: hos.sub.-- boolean RWAccess) ;
42:
43: // Write all dirty buffers.
44: void
Flush(hos.sub.-- boolean unlockedOnly = HOS.sub.-- FALSE);
45:
46: hos.sub.-- boolean FreeUpMemorySpace( );
47: hos.sub.-- boolean FreeUpDiskSpace(hos.sub.-- bio* requestingBio,
48: hos.sub.-- int nBLocks);
49: void
PrepareToRelease( ) ;
50: void
Release(hos.sub.-- boolean force,
51: hos.sub.-- boolean finalCommitRelease = HOS.sub.-- FALSE) ;
52:
53:
private:
54: // . . .
55: void
DeleteBlockMap(s.sub.-- blockmap*, hos.sub.-- mutexCatchLock*,
56: hos.sub.-- mutexCatchLock* = 0) ;
57: void
Destroy(s.sub.-- blockmap*,hos.sub.-- bio*,hos.sub.-- uint
block,s.sub.-- btype) ;
58: s.sub.-- buf*
Find(s.sub.-- blockmap*, hos.sub.-- bio*,
59: hos.sub.-- uint logicalBlockId, s.sub.-- btype);
60: void
FindAndDestroy(hos.sub.-- bio*, hos.sub.-- uint
physicalBlockId,
61: s.sub.-- btype, hos.sub.-- int nLogicalBlocks);
62: void
FindAndDestroy(s.sub.-- blockmap*, hos.sub.-- bio*,
63: hos.sub.-- uint logicalBlockId, s.sub.-- btype) ;
64: // . . .
65:
66:
67:
// ---------------------
68:
69: static const hos.sub.-- cmprstype
70: .sub.-- btypeSpecificCompressions ›NUMBER.sub.-- S.sub.-- BTYPES!
;
71: hos.sub.-- int
.sub.-- blocksize;
// The blocksize of all blocks in
72: // the buffer pool.
73: hos.sub.-- boolean
.sub.-- throwing ;
// True if throwing and bufman
74: // is in inconsistent state
75: // Set/checked by mutex lock objects
76: hos.sub.-- boolean
.sub.-- allowQuickDrop ;
// True if allowing quick drop
77:
78: hos.sub.-- boolean
.sub.-- rwaccess;
// Access HOS.sub.-- TRUE for ReadWrite
79: // HOS.sub.-- FALSE for ReadOnly.
80: hos.sub.-- int
.sub.-- flags ;
// S.sub.-- BUFMAN.sub.-- XXX flags . . .
81: hos.sub.-- int
.sub.-- maxbuffers;
// Slots in buffer table.
82: hos.sub.-- int
.sub.-- nBuffersInUse;
// Current number of slots in use.
83:
84: // .sub.-- nBuffersLocked subject to mutex (non-HPRODUCTION
systems)
85: hos.sub.-- mutex
.sub.-- BuffersLockedMutex ; // controls .sub.-- nBuffersLocked
3
86: hos.sub.-- int
.sub.-- nBuffersLocked; // Current number of slots in use.
87: hos.sub.-- uint
.sub.-- tbuffers;
// Total number of slots ever used.
88:
89: // Use of .sub.-- blockmapBufman doesn't require the
90: // following mutex, but instantiation does.
91: hos.sub.-- mutex
.sub.-- blockmapBufmanMutex;
// mutex for instantiation
92: // of blockmapBufman
93:
94: s.sub.-- bufman*
.sub.-- blockmapBufman;
// a bufman for blockmaps
95:
96: // The following pair of ivars must be manipulated as a unit
97: s.sub.-- hashtb*
.sub.-- hashtable;
// Hashtable to index buffers.
98: hos.sub.-- mutex
.sub.-- hashtableMutex;
// for locking `.sub.-- hashtable`
99:
100: // The following pair of ivars must be manipulated as a unit
101: s.sub.-- hashtb*
.sub.-- blockmaps ;
// blockmaps owned by bufman
102: hos.sub.-- mutex
.sub.-- blockmapsMutex;
// for locking `.sub.-- blockmaps`
103:
104: s.sub.-- bufpool.sub.-- ts
.sub.-- bufpool ;
// Owns/manages. s.sub.-- buf s in LRU chains
105: s.sub.-- bufcache
.sub.-- compressedBufferCache;
// cache results of Read( )
106:
107: hos.sub.-- memloc
.sub.-- inShrmem ;
// in shared memory
108:
};
__________________________________________________________________________
(line numbers added for convenience of description) As shown, the s.sub.-- bufman class is derived from two classes: hos.sub.-- error and s.sub.-- bufman.sub.-- state. The hos.sub.-- error class provides C++-like exception handling; since some C++ compiler implementations do not provide this functionality yet, the exception handling is provided at the source code level. The second parent class s.sub.-- bufman.sub.-- state, provides a base class for the Buffer Manager, which includes general state information (which is not necessary for understanding the present invention). The class definition itself begins with several "friend" class declarations. These are included for performance reasons, so that data members of friend classes can be accessed without employing "wrapper" methods. At line 18, the class definition includes the first public constructor. It is invoked with four parameters. The first parameter, maxbuffiers, specifies the maximum number of buffers for the Buffer Manager being created. This is usually a number on the order of thousands or tens of thousands. The second parameter, blocksize, specifies the block size for each buffer. The third parameter, rwaccess, specifies read/write access privileges for the buffers. Since the system supports read-only buffers, the parameters serve as a flag for indicating to the Buffer Manager whether the buffers are read-only buffers or read/write buffers. The last parameter, in.sub.-- shrmem, is a flag indicating whether the Buffer Manager is a shared memory Buffer Manager. For a shared memory Buffer Manager, memory allocation operations occur out of shared memory. Complementing the constructors, the class definition includes an s.sub.-- bufman destructor, at line 20. The destructor, which performs cleanup, can be implemented in a conventional manner. A final constructor--a copy constructor--is defined, at line 21. The copy constructor simply performs "copy construction"--a well-known C++ technique of creating a copy of an object. This may be implemented using conventional C++ techniques. At line 25, the class definition defines the Create method. At line 27, a second Create method is defined, using C++ overloading features. The two Create methods are essentially identical, except that the latter version employs a "block map." To understand this difference, therefore, it is necessary to describe what a "block map" is. One of the problems in doing compression is that a modified block may no longer compress back down to its original size. Consider a Create operation which yields a 64K buffer. Suppose that the first time the 64K buffer is paged out it compresses down to one 4K block. Here, the page allocator or manager allocates one block (on disk) for storing the block which is being paged out. Suppose, at a later point in time, that the block is brought back in and modified in such a way that it will no longer compress down to one block. The problem which arises, therefore, is that the page must now be relocated, as it no longer fits in its then-existing slot (which has members to the left and to the right). For hierarchical data objects (e.g., B-Tree), it is likely that other members (i.e., ones which store pointers to the block) may also need to be notified of the relocation. To address the foregoing problem, therefore, a block map is employed in the system of the present invention. Within a Buffer Manager, there can be as many instances of a block map as needed. Typically, one exists per object or portion thereof. For instance, one block map can exist per B-Tree, or one per portion of a B-Tree (e.g., non-leaf level pages). Each block map, in turn, may be thought of as a logical-to-physical translator for the object (or subobject) which is its focus. In this manner, the system can concentrate on bringing in the object (or portion of the object) which is needed. From a call to the Create method, a user or client may get back a particular page number, such as page #1. This page number is a logical page number which serves as an index into the corresponding block map for determining the actual physical page number. For implementations without compression, this translator may be eliminated. Returning to the description of the two Create methods, each takes a hos.sub.-- bio pointer parameter, which indicates the particular file which this buffer is being established for. The second Create method also takes the just-described block map, for performing translation in the case of systems which employ compression. The first version is employed, therefore, when the logical-to-physical translation is not required (e.g., for data objects which do not have a hierarchical relation, or when compression is turned off). Since the block map does create an extra level of indirection when doing page mapping and page out operations, it is preferably avoided, when possible. For this reason, the s.sub.-- bufman class definition includes the two versions of the Create method. The other two parameters to the Create method include nBlks and hos.sub.-- cmprstype. The former indicates how many physical database blocks make a page, for the buffer being created. The former parameter specifies the default compression type for the buffer. As shown, the default compression type is "any"--the system picks the compression type. However, the client is given the ability to override this type, by simply specifying the type when invoking the Create method. This is useful in instances where the client has better knowledge of which compression type to employ. The second Create method also includes an s.sub.-- b type parameter. This serves as a flag for indicating what page type the buffer is employed for. For example, page types may include B-Tree, bit map, and the like. This allows the system to perform additional error checking. In the header of each page on disk, a page type data member (byte) is stored, for indicating a particular page type. This may be employed, for instance, during a Find operation for making sure that the page type found is that which is expected. Next, the s.sub.-- bufman class definition defines a CreateBlockMap method. This method simply creates a block map, whose functionality has just been described. The translation from logical-to-physical may be performed using conventional virtual memory management techniques. Lines 33-36 define Destroy methods. Effectively, each Destroy method performs the same functionality: freeing up memory allocated for a buffer and, if corresponding space is allocated on disk, giving it back to the free list or page manager. The parameters for the Destroy method at line 33 correspond to those previously described for the Create methods. At line 38, the s.sub.-- bufman class defines the Find method. The Find method takes as its first parameter a hos.sub.-- bio pointer, which is a pointer to the file which it is desired to "find" a block in. Here, a hos.sub.-- bio file is a logical file. In an exemplary embodiment, as previously described, two versions of files exist. In a first version, a file can be a single file. Alternatively, a file can be a plurality of files simply treated as a single logical file. Here, "bio" stands for block input/output--a system which reads objects in increments of block size. The Find method is followed by a FindBlockMap method; it returns the location of where the block map begins for the particular object of the s.sub.-- bufman class. Next, the Flush method is defined. It simply schedules all dirty buffers to be written to disk, as previously mentioned. This is followed by other housekeeping class methods, including FreeUpMemorySpace, Free UpDiskSpace, and Release. The next section of the class definition is the private declarations. Generally, the private method or routines include corresponding versions of the public methods. These methods are called by respective public methods, but the private routines do not take out locks. This is done to decrease contention for locks, particularly since the methods can be invoked internally (i.e., with the class). In other words, the public methods largely serve as wrappers to the private methods. In operation, therefore, the public methods take out appropriate locks and call onto the private methods. The private methods may be called internally when the object knows that it already has secured the corresponding lock. The remaining private methods are internal housekeeping routines and are not considered relevant to the invention herein. Lines 67-107 of the class definition set forth data member declarations. Line 69 declares an array of type hos.sub.-- cmprstype, which serves for statistics monitoring. The next data member, .sub.-- blocksize, stores the block size of all blocks in the buffer, at line 71. The .sub.-- throwing data member, at line 73, is a flag indicating if the object is currently "throwing" an exception (i.e., the s.sub.-- bufman is in an inconsistent state). The .sub.-- allowQuickDrop data member, at line 76, serves as a flag indicating whether "quick drop" (i.e., no logging) is allowed. The .sub.-- rwaccess data member, at line 78, serves as a boolean indicating whether access is read/write or read-only. The .sub.-- flags data member stores housekeeping flags. The .sub.-- maxbuffers data member stores the number of slots in the buffer table; this number is passed in as a constructor argument. The .sub.-- nBuffersinUse data member, at line 82, stores the current number of slots in use. The .sub.-- nBuffersLockedMutex, at line 85, controls the current number of slots which are locked--in use. The number of locked buffers is stored by .sub.-- nBuffersLocked, at line 86. The total number of slots ever used is stored by the .sub.-- tbuffers data member, at line 87. This is for statistics monitoring, indicating the maximum number of slots ever used. At line 91, the class declares a .sub.-- blockmapBufmanMutex, for mutex (MUTually EXclusive) locking. Use of a mutex for multithreaded synchronization is know in the art. In a preferred embodiment, the block map includes its own s.sub.-- bufman Buffer Manager (i.e., a bufman within a bufman). Accordingly, it employs its own buffer pool. Therefore, at line 94, a bufman for block maps is declared, .sub.-- blockmapBufman. At line 97, the class declares a hash table to index buffers, .sub.-- hashtable. In an exemplary embodiment, as previously described, the system performs a hash on the page number. At line 98, the class declares a .sub.-- hashtableMutex data member for mutex (MUTually EXclusive) locking the hash table (e.g., for reading and writing entries to the table). In a corresponding manner, a hash table is declared for block maps at line 101, and a mutex is declared for locking the block map hash table at line 102. Completing the class definition, the .sub.-- bufpool data member, at line 104, manages the LRU (Least-Recently Used) chain, which guides buffer page reuse. The .sub.-- compressedBufferCache data member, at line 105, is the compressed pre-image buffer cache --that is, it caches the results of the read operation. All reads are performed into this buffer, from which the data can be decompressed into the corresponding s.sub.-- buf buffer. Finally, at line 107, the .sub.-- inShrmem data member serves as a flag indicating whether the s.sub.-- bufman object is in shared memory. b. Exemplary class methods (1) Create In an exemplary embodiment, the Create method can be constructed as follows (using the C++ programming language).
__________________________________________________________________________
1:
s.sub.-- buf *s.sub.-- bufman::Create(hos.sub.-- bio *biop,
2:
s.sub.-- btype btype, hos.sub.-- int nBlks, hos.sub.-- cmprstype
ctype)
3:
{ // Create
4:
s.sub.-- buf*
slot;
5:
hos.sub.-- uint blknu;
6:
7:
hos.sub.-- mutexCatchLock sbufMutexLock(.sub.-- throwing) ;
8:
// no mutex, set in bowels of bufman
9:
10:
// . . .
11:
validbio(biop);
12:
MustHaveRWAccess( );
13:
14:
// Modify the compression type if the argument is ambiguous
15:
// and we have a more specific choice for passed block type
16:
if (hos.sub.-- compress::CompressionTypeIsAmbiguous(ctype))
17: if (.sub.-- btypeSpecificCompressions›btype! |= HOS.sub.-- CMPRST.sub.
-- ANY)
18: ctype = .sub.-- btypeSpecificCompressions›btype! ;
19:
20:
// Allocate block
21:
if (blknu = biop->Allocate(nBlks,HOS.sub.-- FALSE))
22: { // got blocks in one chunk
23: // Get a mutex locked s.sub.-- buf
24: slot = GetBufferSlot(sbufMutexLock, HOS.sub.-- FALSE);
25: // hashtb not locked
26:
27: if (|slot)
28: {
29: // All buffers are in-use
30: S.sub.-- BUFMAN.sub.-- THROWBUF1(BUFMAN.sub.-- ALLSLOTSLOCKED,
slot,
31: .sub.-- maxbuffers);
32: }
33: // init, grab memory buffer
34: slot->Prepare(biop, blknu, btype, nBlks, ctype, 0);
35: if (.sub.-- lastBlock < blknu)
36: .sub.-- lastBlock = blknu;
// stats
37: } // got blocks in one chunk
38:
39:
else
40: // Chain blocks together as needed to satisfy request
41: { // try allocating in smaller pieces
42: stack.sub.-- diskdescriptor newDesc ;
43: AllocateDiskBlocks(&newDesc,biop,GetUserSize(nBlks));
44: blknu=newDesc.GetLinkAddr(0);
45:
46: // Get a mutex locked s.sub.-- buf
47: slot = GetBufferSlot(sbufMutexLock, HOS.sub.-- FALSE);
48: // hashtb not locked
49:
50: if (|slot)
51: {
52: // All buffers are in-use
53: S.sub.-- BUFMAN.sub.-- THROWBUF1(BUFMAN.sub.-- ALLSLOTSLOCKED,
slot,
54: .sub.-- maxbuffers);
55: }
56: slot->Prepare(biop, blknu, btype, nBlks, ctype, &newDesc);
57: s.sub.-- diskblockheader* sbuf=slot->.sub.-- dskBlk;
58: sbuf->SetFirstBlockId(newDesc.GetLinkAddr(0));
59: sbuf->SetFirstNBlocks(newDesc.GetLinkBlocks(0));
60: sbuf->SetChainedBlockId(newDesc.GetLinkAddr(1));
61: sbuf->SetChainedNBlocks(newDesc.GetLinkBlocks(1));
62: } // try allocating in smaller pieces
63:
64: // Allocated page and memory, ready to do work.
65:
66:
// Freshly created s.sub.-- bufs are always dirty
67:
slot->DirtyInternal( );
68:
69:
// we guarantee that the buffer comes back full of zeros
70:
hos.sub.-- memset(slot->GetData( ), 0, slot->GetUserSize( ));
71:
72:
// We're done diddling the slot
73:
slot->LockUserLocked( ) ; // Set the lock on the buffer
74:
sbufMutexLock.UnLock( ); // and release the mutex
75:
76:
// If we throw now the above slot is lost, because it is
77:
// userlocked and will not be unlocked. But if we throw while
78:
// inserting into the hash table we're in big trouble
79:
// anyway and will presumably no longer be useable.
80:
81:
// Always update hashtable after s.sub.-- buf::Prepare
82:
// Done outside s.sub.-- buf mutex lock, since hashtable locks
83:
// must be obtained first
84:
// Note that nobody can Find( ) this buf yet since we
85:
// haven't returned it.
86:
hos.sub.-- mutexCatchLock hashtableLock(&.sub.-- hashtableMutex,
.sub.-- throwing) ;
87:
88:
.sub.-- hashtable->InsertKeyVal(slot->GetPrimaryKey( ),
89: slot->GetSecondaryKey( ), slot) ;
90:
hashtableLock.UnLock( ) ;
91:
92:
.sub.-- ncreates++;
// stats
93:
.sub.-- ncBlks += nBlks;
// stats
94:
return slot; // return ptr to s.sub.-- buf to user
95:
} // Create
__________________________________________________________________________
After some initialization steps, the method validates the passed-in file pointer at line 11. At line 12, the method checks whether it has read/write access. At lines 14-18, the method picks a compression type, if the user has not already specified one. In a preferred embodiment, the compression type is actually assigned on a per-page basis. When no type is specified by the user, the system picks a default compression type (ctype) based on the page type (btype). Internally, the system maintains an array of page types. Stored together with each page type entry is a preferred compression type, thus allowing the system to easily select a preferred compression type for a given page type. At line 21, the method allocates n number of blocks, by calling an Allocate subroutine. The subroutine returns true when the number of blocks allocated in one chunk (i.e., contiguous) equals n. Otherwise (i.e., the Allocate subroutine returns false), the method executes the else statement of line 39, to chain together blocks as needed to satisfy the allocation request. For the instance where the if statement holds true (lines 22-37), the method sets the variable slot to the hash table buffer slot which will store the entry for the allocated block. If a slot is not properly set (tested at line 27), then all buffers are in use and the method throws an exception, at line 30. In a similar manner (and after chaining together blocks), the method at the else statement (beginning at line 39) sets the slot at line 47 and throws an exception at line 53 if no slot is obtained. Here, the page number which is hashed on is stored by blknu (which is returned by the subroutine call to the allocation subroutine). This is the logical page number for the blocks. After a valid slot has been obtained, the method "prepares" (i.e., initializes) the memory buffer for the slot, at line 34 (if allocated in one chunk) or at line 56 (if allocated in multiple chunks). Upon reaching line 64, the method has allocated a page on disk, obtained a hash table slot, and allocated system memory. Note that the slot is actually an sbuf (pointer) which is eventually returned to the user. At line 67, the method "dirties" the buffer, as newly-created buffers are always considered dirty (i.e., do not have a current version stored on disk). The call, in effect, sets the "dirty" bit for the page so that when the page is paged out of memory it is also written to disk. At line 70, the method initializes the buffer, by filling it with zeros. At line 73, the slot is "locked" (i.e., exclusive access). At this point, therefore, the method sets the lock on the buffer, as it is returned to the user locked. At line 74, the method releases the mutex (which was set during initialization). The slot is now ready for insertion into the hash table. Insertion into the hash table is done by obtaining a lock to the hash table, at line 86. Then the slot is inserted by calling a subroutine, InsertKeyVal. The lock is then released, at line 90. At lines 92-93, the method sets statistical flags (for monitoring purposes). Finally, at line 94, the method returns the slot (s.sub.-- buf pointer) to the user. (2) Find In an exemplary embodiment, the Find method may be constructed as follows (using the C++ programming language).
__________________________________________________________________________
1:
s.sub.-- buf *s.sub.-- bufman::Find(hos.sub.-- bio *biop, hos.sub.--
uint block,
2: s.sub.-- btype btype, hos.sub.-- int nBlks)
3:
{ // Find
4:
s.sub.-- buf *slot;
5:
hos.sub.-- boolean try.sub.-- again=HOS.sub.-- TRUE;
6:
7:
// Take block# and bio and build key
8:
hos.sub.-- int key=.sub.-- hashtable->GetHashKey(biop, block) ;
9:
// . . .
10:
11:
// error checking
12:
hos.sub.-- AutoMemcheck( );
13:
if (block < 1)
14: S.sub.-- BUFMAN.sub.-- THROW(BUFMAN.sub.-- BADBLOCK);
15:
validbio (biop);
16:
hos.sub.-- mutexCatchLock hashtableLock(&.sub.-- hashtableMutex,
.sub.-- throwing,
17: hos.sub.-- mutexCatchLock::DONT.sub.-- LOCK.sub.-- NOW) ;
18:
19:
while (try.sub.-- again)
20:
{
21:
try.sub.-- again = HOS.sub.-- FALSE;
22:
hashtableLock.Lock( ) ;
23:
slot = (s.sub.-- buf*) .sub.-- hashtable->FindKey(key, biop, block) ;
24:
25:
// Cache miss
26:
if (|slot)
27: { // pull it in from disk
28: // Get a mutex locked s.sub.-- buf
29: slot = GetBufferSlotLocked(sbufMutexLock,
30: hashtableLock, biop, block); // hashtb *IS* locked
31: if (|slot)
32: {
33: // All buffers are in-use
34: S.sub.-- BUFMAN.sub.-- THROWBUF1(BUFMAN.sub.-- ALLSLOTSLOCKED,
slot,
35: .sub.-- maxbuffers);
36: }
37: slot->Prepare(biop, block, btype, nBlks, HOS.sub.-- CMPRST.sub.--
NONE, 0) ;
38:
39: // Always update hashtable after s.sub.-- buf::Prepare
40: slot->LockUserLocked( ) ;
41:
42: if (|slot->Read( ))
43: S.sub.-- BUFMAN.sub.-- THROWBUF(BUFMAN.sub.-- BIO.sub.-- UNEXPECTEDE
OF, slot) ;
44:
45: // This slot is ready for use
46: sbufMutexLock.UnLock( );
47: } // pull it in from disk
48:
// Cache hit
49:
else
50: { // found slot, move it to head of the queue
51: // error checking
52: if (slot->GetBlockMap( ))
53: // should be using blockmap Find( ) interface
54: hos.sub.-- abort("Programming error") ;
55:
56: sbufMutexLock.SetMutex(&slot->.sub.-- mutex);
57: //
58: if
(|sbufMutexLock.TryLock( ))
59: {
60: // presumably, this is very unlikely to happen|||
61: hashtableLock.UnLock( ) ;
62: try.sub.-- again = HOS.sub.-- TRUE;
63: // re-loop
64:
65: }
66: else
67: {
68: hashtableLock.UnLock( ) ; // not done until slot is locked
69: slot->LockUserLocked( ) ; // NOTE: slot mutex is locked
70: sbufMutexLock.UnLock( ) ;
71: slot->Validate(block, biop, nBlks, btype);
72: .sub.-- bufpool.MoveSlotToHeadofQueue(slot) ;
73: }
74: } // found slot, move it to head of the queue
75:
}
76:
77:
.sub.-- nfinds++;
// stats AND algorithm semantics
78:
.sub.-- nfBlks += nBlks;
// stats
79:
return slot;
80:
} // Find
__________________________________________________________________________
As previously described, the method takes four parameters. The first parameter, *biop, is a binary input/output pointer--a pointer to a logical file or group of files (presented as one object which is being cached). The second parameter, block, is the block (i.e., page number) to find. The third parameter, btype, indicates what page the method is looking for. The last parameter, nBlks, represents how many physical blocks equal one page (i.e., how big is a page the method is looking for). After declaring local variables, at lines 4-5, the method takes the passed-in block number and bio pointer and gets the key from the hash table, by calling a GetHashKey subroutine, at line 8. This serves as a key or index into the hash table. After error checking (line 12-15), the method takes out a lock on the hash table, at line 16. At line 19, a while loop is established, for looking up the hash key. In particular, the method invokes a FindKey subroutine with the hash key, at line 23, for determining a corresponding slot (i.e., s.sub.-- buf pointer). One of two possibilities arise at this point: a cache "miss" or a cache "hit." The cache "miss" case is dealt with at line 26; the cache "hit" case is dealt with at line 49. Each will be described in further detail. The cache "miss" instance is determined when the slot variable equals 0--determined by the if statement at line 26. The if statement processes the cache miss as follows. At line 29, the method attempts to get a free slot. It will get a free slot unless all slots are in use (tested at line 31), whereupon it will throw an exception, at line 34. The method cannot get a slot when it is unable to page out other pages from the cache. Typically, the subroutine call at line 29, GetBufferSlotLocked, will secure a slot, performing any necessary page out operation. Once having secured a slot, the method can proceed to line 37 to "Prepare" (initialize) the slot. In particular at this point, the method initializes data structures in preparation for the read operation which is to occur. Next, the method may proceed to do the actually read. Specifically, a user lock is taken out on the slot, at line 40. This is followed by the actual read, at line 42. If, for some reason, the read operation fails, an exception is thrown, at line 43. Otherwise, the slot is ready for use and the mutex lock may be lifted (line 46). The else statement at line 49 covers the instance of a cache "hit." The condition is processed as follows. At line 52, the mapping for the slot is obtained, by calling the GetBlockMap subroutine. If an error occurs, the method aborts at line 54. Otherwise, the method proceeds to line 56, where a mutex lock is taken out on the slot. At this point, the method will wait for the slot, in the case that the slot is currently in use by another process. In particular, the mIn particular, the method tests at line 58 whether the mutex lock has been successfully obtained. If not (false at line 58), the method will release the lock on the hash table and (effectively) loop back to line 19, for re-executing the while loop. When the mutex lock is successfully obtained, the method enters the else statement, at line 66. After lock maintenance and slot validation (lines 68-71), the method updates the LRU chain, with a subroutine called to MoveSlotToHeadOfQueue, at line 72. In particular, the subroutine call moves this block to the end of the LRU chain (i.e., to the most-recently used end). Upon reaching line 77, the method has found a slot (performing any necessary disk I/O operation) and moved the slot to the head of the queue (i.e., LRU chain). At lines 77-78, statistics are collected. Finally, at line 79, the slot is returned to the caller. (3) Flush In an exemplary embodiment, the Flush method may be constructed as follows (using the C++ programming language).
__________________________________________________________________________
1:
void s.sub.-- bufman::Flush(hos.sub.-- boolean unlockedOnly)
2:
{
3:
// error check
4:
hos.sub.-- AutoMemCheck( );
5:
MustHaveRWAccess( );
6:
7:
if (|hos.sub.-- lwtask::IsMainThread( ))
8: // must be called from parent thread
9: S.sub.-- BUFMAN.sub.-- THROW(BUFMAN.sub.-- PROGERROR) ;
10:
11:
// Free up blockmaps (and unlock their s.sub.-- bufs)
12:
// that aren't in use
13:
hos.sub.-- mutexCatchLock blockmapsLock(&.sub.-- blockmapsMutex,
.sub.-- throwing) ;
14:
s.sub.-- hashtb.sub.-- iterator iterator(.sub.-- blockmaps) ;
15:
while (iterator( ))
16: { // delete blockmaps not in use
17: s.sub.-- blockmap* blockmap = (s.sub.-- blockmap*)
iterator.GetValue( ) ;
18: if (|blockmap->Inuse( ))
19: DeleteBlockMap(blockmap, &blockmapsLock) ;
20: } // delete blockmaps not in use
21:
blockmapsLock.UnLock( ) ;
22:
23:
// Flush the buffers
24:
.sub.-- bufpool.Flush(unlockedOnly) ;
25:
// Will call into s.sub.-- buf::write( )
26:
27:
.sub.-- nflushes++;
// stats
28:
29:
if (.sub.-- blockmapBufman && |unlockedOnly)
30: .sub.-- blockmapBufman->Flush(unlockedOnly) ;
31:
} // Flush
__________________________________________________________________________
The Flush method is invoked with a single parameter, unlockedOnly. This is a boolean, used for optimization, which indicates that only unlocked buffers should be flushed (as opposed to all buffers). At lines 3-5, the method performs error checking, including determining that read/write access is available (to this client). At line 7, the method determines whether it is on the main thread of execution, as the Flush operation is done from the main thread. At lines 11-21, the method frees up blockmaps which are not in use, including unlocking corresponding s.sub.-- buf members (i.e., slots). At line 24, the method does the actual Flush operation, by calling an internal .sub.-- bufpool. Flush subroutine. In essence, the subroutine loops through the s.sub.-- buf members, acquires necessary locks, performs error checking, and writes those which are "dirty." The actual write operations are performed by calling to s.sub.-- buf::write (for each s.sub.-- buf which is "dirty"), which in turn calls compression routines. Thus the call to the internal Flush will, if compression is activated, include an indirect call to the compression routines. Next, statistics are gathered, at line 27, and the blockmap is flushed, at line 30. Thereafter, the method is done and may return. (4) Destroy In an exemplary embodiment, a Destroy method of the present invention may be constructed as follows.
__________________________________________________________________________
1:
void s.sub.-- bufman::Destroy(s.sub.-- bufPtr& sb)
2:
{ // Destroy
3:
4: hos.sub.-- AutoMemCheck( );
5: MustHaveRWAccess( );
6: s.sub.-- buf* slot = sb;
7: sb.Clear( );
// so sb doesn't try to UnLock( )
8:
9: hos.sub.-- int logicalBlocks = slot->GetLogicalNBlocks( ) ;
10:
11: hos.sub.-- mutexCatchLock sbufMutexLock(&slot->.sub.-- mutex,.sub.--
throwing) ;
12:
13: // needs a check here to make sure .sub.-- userlocks == 1
14: slot->UnlockUserLocked( ) ;
15:
16: /*
17: Lock the hashtable to prevent another thread from picking
18: up the freed block(s) and trying to register them in the
19: hash table before we've deregistered it from the hash table.
20: This lock is unfortunate, since deallocate may be non-trivial
21: from a bio standpoint. The RIGHT thing to do is unregister
22: the block in the hashtable before doing the deallocate.
23: However, this is non-trivial in the current implementation.
24: WARNING: PROBABLE THREADED PERFORMANCE BOTTLENECK
25: */
26:
27: hos.sub.-- mutexCatchLock hashtableLock(&.sub.-- hashtableMutex,
.sub.-- throwing) ;
28:
29: // See if we can deallocate it the quick way
30: if (.sub.-- allowQuickDrop && |slot->GetDirty( ) && slot->Destroy(
))
31: { // did quick destroy
32: .sub.-- nqdestroys++;
// stats
33: .sub.-- nqdBlks += logicalBlocks ;
// stats
34: } // did quick destroy
35: else
36: slot->Deallocate( );
// pre-image the slot
37:
38: // This slot is to be eligible, but we're releasing
39: // it's diskblock memory
40: // The following 3 statements, plus the above hashtable lock,
41: // used to be s.sub.-- bufman::MoveSlotToEmptyList( )
42:
43: .sub.-- bufpool.MoveSlotToEmptyList(slot,
44: HOS.sub.-- FALSE,
// release diskblock memory
45: HOS.sub.-- TRUE,
// hashtable is locked
46: // callee shouldn't really need this
47: &.sub.-- hashtableMutex,
48: .sub.-- hashtable) ;
49: hashtableLock.UnLock( ) ;
50: .sub.-- nBuffersInuse--;
51:
52: sbufMutexLock.UnLock( );
53:
54: .sub.-- ndestroys++;
// stats
55: .sub.-- ndBlks += logicalBlocks;
// stats
56:
} // Destroy
__________________________________________________________________________
The method is invoked with a single parameter, a pointer to an s.sub.-- buf. After error checking, at lines 4-5, the method stores the s.sub.-- buf pointer into a local variable, slot. At line 7, the method clears the lock which the user has. At line 9, the method gets the number of logical blocks--that is, how big it is. At line 11, the method gets a mutex lock for the slot. At line 14, the method checks the user lock, making sure the user lock count equals 1. Since the buffer is being destroyed, the method unlocks the user lock at this point. At line 27, the method takes out a hash table mutex lock. At lines 30-36, the method undertakes to actually destroy the slot. If "quick drop" (i.e., without logging) is allowed and the slot is not dirty (i.e., does not need to be written to disk), a slot Destroy method is invoked for actually destroying the slot. If the Destroy operation is successful, statistics are set at lines 32-33. Otherwise (e.g., when "quick drop" is not allowed), the method calls a Deallocate subroutine, at line 36. This subroutine will pre-image the slot before destroying it, thus allowing error recovery. By the time the method reaches line 38, the slot has been destroyed on disk; its space is returned to the free list or page manager. In a corresponding manner, the in-memory buffer is marked as destroyed. Now, the slot is moved from the "in-use" list to an "empty" list, by a subroutine call at line 43. The hash table entry is freed during the call to "destroy" the slot (either at line 30 or line 36, depending on whether "QuickDrop" is allowed). The method concludes by releasing locks: the hash table lock is released at line 49 and the mutex lock is released at line 52. Statistics are collected (line 50 and lines 54-55). Thereafter, the method returns to the caller. 3. Detailed construction and operation of s.sub.-- buf In an exemplary embodiment, the s.sub.-- buf class definition may be constructed as follows (using the C++ programming language).
__________________________________________________________________________
1:
// s.sub.-- buf class
2:
3:
class s.sub.-- buf
4:
{
5:
friend class s.sub.-- bm;
6:
friend class s.sub.-- blockmap ;
7:
friend class s.sub.-- bufcache ;
8:
friend class s.sub.-- bufcacheitem ;
9:
friend class s.sub.-- bufman;
10:
friend class s.sub.-- bufman.sub.-- exception ;
11:
friend class s.sub.-- bufpool.sub.-- ts ;
12:
friend class s.sub.-- bufpool.sub.-- MRU.sub.-- iterator.sub.-- tg ;
13:
friend class s.sub.-- bufpool.sub.-- LRU.sub.-- iterator.sub.-- tg ;
14:
friend class hs.sub.-- bufADBM.sub.-- tg ;
15:
16:
public:
17:
18: void LockUser( ); // Public use only for recursive locking
19: void UnlockUser( ) ;
20: void Dirty( ); // Do this BEFORE modify the buffer
21:
22: /* Use GetBlock for a handle on the s.sub.-- buf on disk */
23: hos.sub.-- uint
GetBlock( )
const
{ return .sub.-- block ; }
24:
25: hos.sub.-- int
GetBlockSize( )
const ;
26:
27: hos.sub.-- bio
*GetBioP( )
const
28: {
29: return .sub.-- biop;
30: }
31:
32: s.sub.-- btype
GetBType( ) { return .sub.-- btype; }
33: hos.sub.-- int
GetChainedNBlocks( ) const
34: { return .sub.-- diskdesc.GetTotalChainedBlocks( ) ; }
35: void *GetData( ) const { return .sub.-- dskBlk->GetData( ); }
36: /* GetNBlks is an undesirable export,
37: use GetLogicalNBlocks instead */
38: hos.sub.-- int
GetNBlks( ) const
39: { return .sub.-- dskBlk->GetLogicalNBlocks( ) ; }
40: hos.sub.-- int
GetUserSize( ) const ;
41: /* These are preferred over GetNBlks */
42:
hos.sub.-- int GetFirstNBlocks( ) const {return .sub.-- dskBlk->GetFirs
tNBlocks( ); }
43: hos.sub.-- int
GetLogicalNBlocks( ) const { return .sub.-- nLogicalBlks ; }
44: hos.sub.-- int GetPhysicalNBlocks( ) const
45: { return .sub.-- diskdesc.GetTotalBlocks( ) ; }
46: void MarkBlocksInUse(hs.sub.-- xstats* stats) const ;
47:
48:
protected: // methods
49:
50: s.sub.-- buf(hos.sub.-- memloc in.sub.-- shrmem) ;
51: .about.s.sub.-- buf( );
52:
53:
private: // methods
54:
55: s.sub.-- buf( );
56: s.sub.-- buf&
operator=(const s.sub.-- buf& rhs);
57:
58: void AllocateBlockMappedBlocks(hos.sub.-- uint dataBytes,
59: hos.sub.-- boolean reregister) ;
60: hos.sub.-- byte*
AllocateDiskBlockMemory(hos.sub.-- int nLogicalBlocks,
61: hos.sub.-- int blockSize) ;
62: void BumpNUsed( ) { .sub.-- nused++; }
63:
64: hos.sub.-- byte*
Compress(hos.sub.-- uint& compressedBytes) ;
65: hos.sub.-- boolean
CompressionEligible( ) const ;
66: hos.sub.-- boolean
CompressionRequested( ) const
67: { return .sub.-- dskBlk->GetRequestedCompressionType( )
68: > HOS.sub.-- CMPRST.sub.-- NONE ; }
69: void Deallocate( ) ;
70: void Decompress(s.sub.-- diskblockheader* tmpHeader) ;
71:
72: void DeleteDiskBlockMemory(s.sub.-- diskblockheader*& sdbh) ;
73:
74: hos.sub.-- boolean Destroy( ) ; // do bio destroy instead of
deallocate
75: void DirtyInternal( ) ; // internal (unlocked) version of Dirty( )
76: void DropDiskBlocks(hos.sub.-- boolean save.sub.-- buffer);
77:
78: static hos.sub.-- uint debugBlockSum(s.sub.-- diskblockheader*,
hos.sub.-- int) ;
79: void debugBlockHeader(const char*) const ;
80:
81: void Dump(const hos.sub.-- text *vn = 0) const ;
82: void Dump(const hos.sub.-- text *vn, hos.sub.-- int idx) const ;
83: void FreeExcessDiskBlocks(s.sub.-- bufdiskdescriptor* desc,
84: hos.sub.-- uint bufBytes) ;
85: // Internal GetBlock which doesn't require true IsUserLocked( )
86: hos.sub.-- uint
GetBlockInternal( ) const { return .sub.-- block ; }
87: s.sub.-- blockmap*
GetBlockMap( )
const
{ return .sub.-- blockmap ; }
88: s.sub.-- bufman*
GetBufMan( )
const
{ return .sub.-- bufman; }
89: hos.sub.-- int
GetDirty( ) const { return .sub.-- flags & S.sub.-- BUFDIRTY
; }
90: hos.sub.-- int GetIsMap( ) const { return .sub.-- btype == S.sub.--
BTYPE.sub.-- BLOCKMAP ; }
91: s.sub.-- bufdiskdescriptor* GetDiskDescriptor( ) { return &.sub.--
diskdesc ; }
92:
93: // Generate blockmap sensitive hash key for the bufman
94: void* GetPrimaryKey( ) const ; // Bio or Blockmap ptr
95: hos.sub.-- uint
GetSecondaryKey( ) const
96: { return .sub.-- block ; } // logical or physical block id
97: /* These return physical block id's */
98: hos.sub.-- uint
GetFirstBlockId( ) const
99: hos.sub.-- uint
GetChainedBlockId( ) const ;
100:
101: /* This method is only for debugging */
102: hos.sub.-- uint
GetUnreliableFirstPhysicalBlockId( ) const
103: { return (.sub.-- dskBlk ? .sub.-- dskBlk->GetBlock( ) : 0) ; }
104:
105: hos.sub.-- int
GetFlags( ) { return .sub.-- flags; }
106: s.sub.-- buf*
GetNextYoungest( ) const
{ return .sub.-- next; }
107: s.sub.-- buf*
GetPrevOldest( ) const
{ return .sub.-- prev; }
108: hos.sub.-- boolean HasPhysicalBlocks ( ) const
109: { return .sub.-- dskBlk->GetFirstBlockId( ) |= 0 ; }
110:
111: hos.sub.-- boolean
IsBlockMapped( ) const
{ return .sub.-- blockmap |= 0 ; }
112: hos.sub.-- boolean
IsUserLocked( ) const
{ return .sub.-- userlocks > 0; }
113:
114: void LockUserLocked( ) ;
// assumes .sub.-- mutex is locked
115: void UnlockUserLocked( ) ;
// assumes .sub.-- mutex is locked
116:
117: void Lock( )
118: {
119: .sub.-- mutex.Lock( ) ;
120: }
121:
122: void UnLock( )
123: {
124: .sub.-- mutex.UnLock( ) ;
125: }
126:
127: hos.sub.-- boolean
TryLock( )
128: {
129: return .sub.-- mutex.TryLock( ) ;
130: }
131:
132: void MustBeActiveBufThrow(const hos.sub.-- text *variant) ;
133:
134: void MustBeActiveBuf( )
// look for valid data
135: {
136: if (.sub.-- dskBlk == 0 .parallel. GetBlockInternal( ) == 0)
137: MustBeActiveBufThrow("Active") ;
138: }
139:
140: void MustBeInactiveBuf( )
// most ivars should be zero
141: {
142: if (.sub.-- biop |= 0 .parallel. .sub.-- block |= 0)
143: MustBeActiveBufThrow("Inactive") ;
144: }
145:
146: void RawRead(s.sub.-- diskblockheader *outputBuffer,
147: hos.sub.-- uint blockId, hos.sub.-- int nBlocks) ;
148: hos.sub.-- boolean Read( );
149:
150: void ReallocateBlockMappedBlocks(hos.sub.-- uint dat | ||||||
