Encoded-data database for fast queries6804664Abstract A new and efficient way of structuring databases enabling efficient query processing, reduced database storage requirements, and more flexible content targeting is described. A database is defined having bit mapped data. The data to be stored in the database is characterized as a number of binary questions. Each record comprises one or more groups of bit maps corresponding to the answers to the questions. For some types of data, there may be two or more possible values. In such cases, each possible value is itself a binary question. In some cases, there may be data which does not lend itself to characterization as binary questions. In such cases, the database includes numeric and string values. Queries of the database may be obtained through simple bit wise Boolean operations directly of the records in the database. Claims It is claimed: Description NOTICE OF COPYRIGHTS AND TRADE DRESS
Question Answer Number Answer
Gender? A1 Female
Gender? A2 Male
Household Income? A3 $0-19,999
Household Income? A4 $20-49,999
Household Income? A5 $50-99,999
Household Income? A6 $100-
199,999
Household Income? A7 $200-300,000
Household Income? A8 $300,000+
The gender attribute therefore requires two bits, and the household income attribute requires 6 bits. These two binary attributes may be mapped to an eight-bit chunk in the following manner:
Answer Number Bit Offset
A1 1
A2 2
A3 3
A4 4
A5 5
A6 6
A7 7
A8 8
Given the mappings, consider the case of a female user with an income of $350,000. Answer 1 is selected so bit 1 is set to one. Answer 8 is selected so bit 8 is set to 1. All other bits are left as 0. The binary representation is 1000 0001, so the chunk has a decimal value of 129. A "range attribute" can have a range of numeric values. A record with a range attribute has a single number as its value. A target for this-attribute has two numbers: low range and high range. Examples of a range attribute include a user profile attribute of "zip code" or machine configuration profile attribute of "processor speed." Range attributes are useful for numeric attributes that have too many potential values to pragmatically store them as binary attributes. A "string attribute" can have a free form string value. A record with a string attribute has a string value that represents its value. A target for this attribute has a list of strings. Examples of a string attribute include a user profile attribute of "first name" or a content profile attribute of "city." String attributes are non-numeric attributes that have too many potential values to pragmatically store them as binary attributes. The correlation table 160 is used in conjunction with binary attributes. The correlation table 160 shows the connection between each answer and the bit that represents it. The correlation table 160 shows how to encode binary attributes, providing static mapping between allowable value and bit location. The user profile database 140 stores and provides information about users. In some cases, such as with an Internet service provider, the users may provide some user profile information directly. User profile information may be obtained from existing databases or through indirect means (e.g., analysis of user habits, purchasing histories). Users may be required to or may voluntarily update their user profiles on a periodic basis. The user profiles may include demographic information such as age, sex, marital status, home address and hobbies, as well as psychographic information such as likes and dislikes. Information about the client devices used by each user, such as type of device and processing capabilities, may also be obtained and stored in the user profile database 140 or another database. The content usage database 141 stores and provides information about content usage by users. That is, the content usage information is what the user is receiving and seeking through the network. In the case of an ISP, content usage takes the form of web browsing. Information about a user's web browsing may be captured by the client device 100, analyzed and stored in the content usage database 141. In the case of a television service, this content usage may take the form of TV shows, channels selected at particular times and video rentals. In the case of non-electronic publishing, the content may be newspapers, magazines and books purchased, borrowed or rented. The interaction database 142 stores and provides information about specific information provided by the users. The interaction database is for storing keywords typed into search engines, information typed into on-line forms, etc. The interaction data differs from the content usage data in that the interaction data does not specify particular content, but rather identifies subject matter considered to be important for some reason to each user. The geographic data database 143 stores and provides information about a user's location. A user's dial-up information, such as a local access telephone number or the phone number of the client device 100 may be provided to the database server 130 which in turn stores the geographic location of the client device 100 in the geographic data database 143, derived through cross-referencing the known phone number against an appropriate database. Geographic data may also be obtained from the client device 100 through a GPS or other locator system. When using some types of mobile devices, the device's current location may be obtainable. The content profile database 150 stores and provides content profiles. The scheduling database 151 stores and provides information about the times when an item of content is to be served. Scheduling may include frequency, maximum number of times to send to a user, minimum number of times to send to a user, first and last days to send, rotation, CPM, impressions to serve, impressions received, click yield and priority. The content/location database 152 stores and provides items of content (e.g., HTML files) and/or the information about how to serve items of content (e.g., URLs). The client profile database 170 stores and provides profiles of client devices 100. Client profiles may include such information a software versions, processor type, processor speed, memory size, modem type, etc. The client profile database 170 is related to the user profile database 140, in that the profiles of client devices 100 used by the users are stored in the client profile database 170. User profiles may refer to client profiles, and client profiles may refer to user profiles. This is because a given client may be used by multiple users, and a given user may use multiple client devices. The Methods of the Invention Referring now to FIG. 2, there is shown a flow chart of the method of storing data in a database in accordance with the present invention. After the method begins (step 205), the database server 130 receives "raw" data for a record (step 210). By "raw," it is meant that the data includes fields which require processing before they are used. The raw data may be received directly, such as through the client device 100, or indirectly. Alternatively, the database server 130 may receive complete records, or profiles therefore. For example, the client device 100 may pass profiles to the database server 130. For those fields of the raw data which correspond to binary attributes, the database server 130 looks up the fields in the correlation table 160 and retrieves the bitmask for the binary attributes corresponding to the fields (step 220). For range attributes and string attributes, some processing may also be useful or necessary prior to storing. For example, range values may be rounded, and string values may be concatenated and parsed for illegal characters. The retrieved bitmask, plus the values of the range attributes and string attributes, form one or more profiles of the record. Next, the database server 130 stores the profiles of the record in the appropriate databases 140-143, 150-152, 170 (step 230). This then completes the process of storing the record in the databases (step 295). Although the process has been described with respect to a single record, records may be processed and stored sequentially or in bulk. Referring now to FIG. 3, there is shown a flow chart of a method of querying a database in accordance with the present invention. The method is first described in general terms, and then details of the method are developed further with respect to FIGS. 4-7. The method of querying the databases is generally implemented as one or more computer programs of the database server 130. Some aspects of the method are or may be operated by other parts of the system of the invention, such as client device 100. As the method begins (step 305), a query is initiated (step 310). A query is initiated, for example, when the client device 100 requests content, such as a list of targeted advertisements to be displayed. Any of the databases may be queried, and a query may be made to more than one database. For convenience, the database or databases to be queried will be referred to as the "query database." In this step 310, a query profile is prepared. The query profile may include combinations of binary attributes, range attributes and string attributes. The query profile may be a copy of or based upon records of one or more of the databases 140-143, 150-152, 170. As described below, the method of querying a database is a reductive process. In a typical database query, the fields of each record of the database is compared in turn against the query. If the fields of a record match the query, then the record is added to a list of matching records. A typical database query is therefore an additive process. According to the invention, a query is structured to benefit from how the data is stored (as described above), and how a query may be best processed. The present invention may also be considered a successive filtering process. According to the invention, the query starts with a holding list of all records in the query database (step 320). Then, those records which do not match the query profile are eliminated from the holding list (steps 340, 350, 360). The holding list defines the query results. In the preferred embodiment, the holding list has an identifier for each record, rather than copies of the records. This reduces memory requirements and enhances processing efficiency. However, the holding list could have complete, partial or compacted copies of each record, depending on the embodiment. Furthermore, the holding list may be prepared at a point other than directly after the query is initiated. For convenience, the following discussion will not refer to the identifiers in the holding list, but instead will refer to the records as if they were in the holding list. This manner of abbreviation should not be taken in a limiting sense. In some case, such as "run of site" ads in an ad list, records may be added to the holding list after matching, or before matching but excluded from removal. There is another significant difference between the typical database query and a database query in accordance with the invention. Both consider records in turn. However, in the present invention, when considering a particular record, not all profiles of the record are compared against the query profile in the same cycle. Instead, binary attributes for all records are tested against the binary attributes of the query profile. Those records not matching are eliminated from the holding list. Next, range attributes for all records in the holding list are tested against the range attributes of the query profile. Those records not matching are eliminated from the holding list. Finally, string attributes for all records in the holding list are tested against the string attributes of the query profile. Those records not matching are eliminated from the holding list. It has been found that this method provides faster and faster results the larger the requested list is when compared to an additive process. In the binary attribute matching step (step 340), the binary attributes of the query profile are matched against the binary attributes in the query database. For example, if the query profile is based upon a user profile from the user profile database 140, then the query profile would be matched against the content profiles in the content profile database 150. Matching may be accomplished through binary operations, such as logical AND. In the range checking step (step 350), the range attributes of the target profile are matched against the range attributes in the query database. This may be accomplished through numeric and Boolean operations. In the string checking step (step 360), the string attributes of the target profile are matched against the string attributes in the query database. This may be accomplished through string operations. With each of the profiles checked, the database query is complete (step 395). Depending on the type of query, however, an additional step may be desirable to further narrow or order the results. This additional step may include the application of business rules for filtering and ordering the records in the holding list. Referring now to FIG. 4, there is shown a flow chart of one embodiment of the binary attribute matching step (step 320). After entry into this routine (405), the database server 130 establishes an order for processing chunks (step 420). Where there is only one chunk, this step 420 is moot. However, where there are multiple chunks, it is better to determine an order for processing the chunks prior to processing the chunks. With the processing of each chunk, the number of records in the holding list to be processed declines. Thus, binary attribute matching is more efficient if chunks having more non-null values are processed before chunks having less non-null values. The chunks should be processed accordingly. With the order of processing set, processing can begin. Assuming there are n chunks to be processed, chunks will be processed from 1 to n. Thus, the next step is to set the chunk number C to 1 (step 415). In this step, processing of the holding list is set to the beginning, for example by setting the record number R in the holding list to 1. Now, the comparison of chunk C of the query profile is made against chunk C of each of the records in the holding list. This is done by first getting the chunk C for record R in the holding list (step 420). Next, chunk C of record R is tested (step 430), and if chunk C of record R is null, then there is no reason to exclude record R from the holding list, so record R should remain in the holding list. On the other hand, if chunk C of record R is not null, it is compared against chunk C of the query profile (step 440). If the chunks match, or if chunk C of record R is null, then it is tested whether record R is the last in the holding list (step 450). This particular form of chunk matching presumes that chunks are logically joined by a logical AND. If record R is not the last in the holding list, then processing should continue with the next record in the holding list (e.g., by incrementing R) (step 420). A chunk of a record matches the chunk of a query profile when the chunk of the record has all of the bits set in the same position as those in the chunk of the query profile for the purposes of equality. For inequality targeting, then there would be a match only if the chunk of the record did not have the same bit set as the chunk of the query profile. This does not mean that the chunk of the record and the chunk of the query profile are equal numerically. Each bit that is not set (i.e., a logical 0) in the chunk of the query profile indicates the query profile has no preference for this target. For example, if a question is not targeted for a particular ad profile, then all values of the question's answers would be zero. For all intents, a zero in a query profile indicates that the value may be either 1 or 0 in the record profile and still count as a match. For example, suppose that each chunk has a length of eight bits, a record has a chunk of 102, and the chunk of the query profile is 6. Is there a match between the two values? The answer is yes. Based on the binary representations of each number, 0110 0110 for 102 and 0000 0110 for 6, we can clearly see that for each bit set to one in the chunk of the query profile (#2 & # 3 from the right), the exact same bit is also set to one in the record chunk. All other bits that are set in the record profile are irrelevant. However, instead of having to evaluate each bit separately, the entire chunk can be evaluated at one time by using the query profile's chunk as a bit mask. This is achieved by performing a logical AND (binary operator `&`) between the record chunk and the query profile chunk. Then compare the query profile chunk with the result. If the query profile chunk equals the result, then the two chunks are matching. This can be seen using the following logical representation of the example:
Binary Decimal
Record 0110 0110 102
Query Profile & 0000 0110 6
Result 0000 0110 6
The result of the AND operation and the query profile equivalent so the record matches the query profile. An example of a non-matching record and query are 100 and 6:
Binary Decimal
Record 0110 0100 100
Query Profile & 0000 0110 6
Result 0000 0100 4
Since the result is not equal to the query profile chunk, the record does not match the query profile. If the query profile specifies that the record cannot contain a specific value, then matching is identical except for the final evaluation, where if the query profile chunk equals the result, then the record is removed from the holding list, otherwise it is retained. Given the fast nature of the comparisons, the larger the chunk, the more questions/answers that can be evaluated at one time. However, given that a processor can only operate on a number of bits at a time, it does not provide any large amount of efficiency to use chunks larger that the processor can operate on at one time. Also, given the ability and need to establish multiple OR conditions for a specific question, breaking the profile into multiple chunks allows accommodation of the OR conditions without having to recreate the entire profile structure for each OR. Instead, only the chunk that contains the questions/answers the OR deals with need be recreated. In the current embodiment, that optimal number is set to 64, since the database server runs on a 64-bit processor running a 64-bit operating system. Returning to step 440, if the chunks do not match, then record R is removed from the holding list (step 445). Processing then continues with the next record in the holding list (step 420). Returning to step 450, if the end of the holding list has been reached, then the next chunk of the query profile is set for matching (e.g., C is incremented) (step 460). If there are no more chunks to process (step 470), then processing of binary attributes is complete (step 495). Otherwise, processing continues with the first record in the holding list (e.g., R is set to 1) (step 420). Referring now to FIG. 5, there is shown a flow chart of a range checking step (step 350). With entry into this procedure (step 505), cyclical processing in a manner similar to binary matching is performed. Assuming there are n ranges in the query profile to be processed, the ranges will be processed from 1 to n. Thus, the next step is to set the range number R to 1 (step 515). In this step, processing of the holding list is set to the beginning, for example by setting the record number R to 1. Now, the comparison of range R of the query profile is made against the value of range R of each of the records in the holding list. This is done by first getting the value of range R for record R in the holding list (step 520). Next, the value of range R of record R is compared against range R of the query profile (step 540). If the value of range R of record R is within range R of the query profile, then it is tested whether record R is the last in the holding list (step 550). If record R is not the last in the holding list, then processing should continue with the next record in the holding list (e.g., by incrementing M) (step 520). Returning to step 540, if the value of range R of record R is not within range R of the query profile, then record R is removed from the holding list (step 545). Processing then continues with the next record in the holding list (step 520). Returning to step 550, if the end of the holding list has been reached, then the next range of the query profile is set for testing (e.g., R is incremented) (step 560). If there are no more ranges to process (step 570), then processing of range attributes is complete (step 595). Otherwise, processing continues with the first record in the holding list (e.g., R is set to 1) (step 520). Referring now to FIG. 6, there is shown a flow chart of a string checking step (step 360). With entry into this procedure (step 605), cyclical processing in a manner similar to binary matching and range checking is performed. Assuming there are n strings in the query profile to be processed, the strings will be processed from 1 to n. Thus, the next step is to set the string number S to 1 (step 615). In this step, processing of the holding list is set to the beginning, for example by setting the record number R to 1. Now, the comparison of string S of the query profile is made against string S of each of the records in the holding list. This is done by first getting the string S for record R in the holding list (step 620). Next, string S of record R is compared against string S of the query profile (step 640). If the string S of record R matches string S of the query profile, then it is tested whether record R is the last in the holding list (step 650). If record R is not the last in the holding list, then processing should continue with the next record in the holding list (e.g., by incrementing M) (step 620). Returning to step 640, if string S of record R does not match string S of the query profile, then record R is removed from the holding list (step 645). Processing then continues with the next record in the holding list (step 620). Returning to step 650, if the end of the holding list has been reached, then the next string of the query profile is set for testing (e.g., S is incremented) (step 560). If there are no more strings to process (step 570), then processing of string attributes is complete (step 695). Otherwise, processing continues with the first record in the holding list (e.g., R is set to 1) (step 620). As can be seen, a database query in accordance with the invention can be very efficient. Further efficiency can be obtained by implementing key steps, such as the chunk matching step, in assembly language. It can be seen that such steps are assembly language friendly, so they are easily reduced to short assembly language code. Furthermore, the method is highly conducive to multi-threading and parallel processing. This efficiency comes at the cost of an extra step during data entry/storage of encoding data. Additionally, known compression and space-reduction techniques may be applied to the records and databases (and other structures). For example, values may be represented using dictionary-type methods, including methods that match bit patterns that are less than the entire length of a value. An effect of this compression and the compression techniques above is to produce more random bit patterns, which in turn improves hashing performance. In addition, the correlation table 160, databases 140, 141, 142, 143, 150, 151, 152 and other structures may be compressed, for example, using methods that take advantage of repeated bit patterns, such as run-length encoding, and word compaction (i.e., packing values into physical data storage units when there is a mismatch between the value size and the physical storage unit). A database in accordance with the invention therefore provides a number of benefits. First, it provides for reduced cost. An ad server that has low overhead and fast execution greatly reduces the amount of hardware and time to serve ads. Second, it provides content flexibility. There is considerable benefit to having a system which can target content, instead of just advertising. Content may include advertising for a regular banner ad, URL/keyword selection, customization of the user interface for a client application, customization of a start page for Internet access, or web page content customization. All of these items may be targeted in the same manner or in different manners. Third, it provides enhanced targeting through expanded and more flexible attribute selection. Fourth, it provides maximum efficiency. Given the cost of typical ad serving solutions in time, hardware and maintenance, the system provides the maximum effect with the least cost. This includes minimizing the use of RAM and other relatively expensive components (as opposed to disk space.) This also includes minimizing the impact on other server and client systems to reduce the requirements to support those systems. Although exemplary embodiments of the present invention have been shown and described, it will be apparent to those having ordinary skill in the art that a number of changes, modifications, or alterations to the invention as described herein may be made, none of which depart from the spirit of the present invention. All such changes, modifications and alterartions should therefore be seen as within the scope of the present invention.
|
Same subclass Same class Consider this |
||||||||||
