Internet profiling6839680Abstract A system, method, and various software products provide for consistent identification of web users across multiple web sites, servers and domains, monitoring and capture of data describing the users' web activities, categorization of the web activity data, aggregation of the data into time dependent models describing interest of users and groups over time. Categorization is made with respect to a category tree which may be standardized or customized for each web site. User groups may be defined based on membership rules for category interest information and demographics. Individual user profiles are then created for users automatically based on satisfaction of the user group membership rules. As new data is collected on a user over time, the category interest information extracted from the user's web activity is updated to form a current model of the user's interests relative to the various categories. This information is also used to automatically update group membership and user profile information. Identification of users across multiple sites is provided by a global service that recognizes each user and provides a globally unique identifier to a requesting web server, which can use the identifier to accumulate activity data for the user. Client side user identification is provided to track user activity data on web servers that do not communicate with the global service and do not process activity for category information. User profiles may be shared among web sites that form alliances. User activity data may be aggregated along various dimensions including users/user groups, categorization, and time to provide robust models of interest at any desired time scale. Claims We claim: Description BACKGROUND
Web Event Record
Field Explanation
User ID Uniquely identifies the visitor
Location The URL or URI of the web content.
Start time Onset of activity in Greenwich Mean Time for a single
event. If there are multiple events at this URL, then the
time of the earliest download.
End time Last recorded activity in Greenwich Mean Time for a
single event. If there are multiple events at this URL,
then the time of the last download. If unknown, a default
1 minute from the start time is used.
Event type Stores a value indicating the type of web activity, such as
view, clickthrough, purchase, and so forth.
Event count The number of times this URL/URI was downloaded
Category Score The category scores for the content.
For example, assume that a user's web activity is as follows:
Activity Start Time-End Time URL Duration
1 10:05 am-10:10 am <URL A> 5 min
2 10:10 am-10:12 am <URL B> 2 min
3 10:12 am-10:14 am idle
4 10:14 am-10:15 am <URL C> 1 min
5 10:15 am-10:15:03 am <URL B> 3 sec
6 10:15:03 am-10:16 am <URL A> 57 sec
7 10:16 am-10:16:06 am <URL D> 6 sec
8 10:16:06 am- <URL A> 4 sec
10:16:10 am
9 10:16:10 am- <URL E> 6 min 20 sec
10:22:30 am
10 10:22:30 am-10:30 am idle
The web event records may be generated by either the web client 108 or the web server 102. If generated on the web client 108, the corresponding web event records would be as follows (note that the user ID and category score information is not shown here).
URL Start-time End-time Duration Occurence
<URL A> 10:05 am 10:16:10 am 5 min 57 sec 2
<URL B> 10:10 am 10:12 am 4 min *(see Note 2) 1
<URL C> 10:14 am 10:15 am 1 min 1
<URL E> 10:16:10 am 10:22:30 am 5 min *(see Note 3) 1
Note 1. When a URL is captured, the current time is stored in the Start-time timestamp field in web event record. The difference between the current time and the time in the timestamp of the previous record is calculated and stored in the previous record's "duration" field. Note 2: Duration may or may not equal (End-time --Start-time). This is because there may be other events between the earliest download at this URL and the last download. For example, there is a gap of 2 minutes between visits to <URL B> and <URL C>. The "duration" in the activity table shows the actual time a user spends on browsing a particular URL, while the "duration" in web event record is an approximation of that time. Where the web event record is created by the web client 108, then the client software may only approximate the real "duration" by taking the Start-time of the next URL as the End-time of the current URL. There is no way for the software to know about idle gaps in between URL visits without user intervention. Where the web event record is generated by the web server 102 that is tracking the user, then the duration can be estimated. Note 3. Here too, the duration for <URL E> can only be calculated by the web client 108 as 13 min 50 sec (10:30--10:16:10=00:13:50). The web client 108 will not know of the idle time after the access to <URL E>. However, the web client 108 (or the web server 102) may keep a pre-set max time for the duration of a single URL access, for example, 5 minutes. This is to normalize the "duration" factor so that no one single URL access can have abnormally large "duration". A user may be tied with other activities for a while between the two URL accesses, and this may result in some abnormally large duration numbers. Those abnormally large duration numbers will incorrectly affect a user's Web usage pattern and profile. Note that the cumulative duration, however, is not limited to that max duration number. For example, the duration for <URL A> is an aggregation of two separate URL accesses; therefore, it is not confined to the 5 minutes limitation. Note 4. Activities 5, 7, and 8 were not included in the total duration of any web event since they were filtered out for being two short of a period of time. This is done to help reduce the data collection requirements and because such short duration views are not likely to be indicative of the user's actual interests. The next sections we describe the architecture and functionality of a system which records web events and provides the various capabilities to aggregate data as described. II. Overview of ProReach System Architecture The present invention may be embodied in a system which we call "ProReach". We begin with a very high-level overview of the ProReach architecture, and describe the high-level components involved in this architecture, and show the high-level relationships between these components. We will also describe some typical configurations of ProReach, and show how ProReach supports one or more web servers, both behind and across firewalls. A discussion of the basic elements of alliances is included. Referring to FIG. 3, there is shown various ProReach systems 100 operating over the Internet. Each ProReach system 100 handles one or more web servers 102. These web servers 102 can all belong to the same domain, or they can be belong to different domain. FIG. 1 depicts two ProReach systems 100. One ProReach system 100 supports a single web server 102, while the other ProReach system 100 supports two web servers 102. In all, there are three ProReach-enabled web servers 102 in this figure. Each ProReach-enabled web server 102 of a ProReach system 100 tracks 20 the web visits of individual web visitors at the web site that the web server 102 serves. The web server 102 tracks and identifies the web visitor, obtains category information for the viewed content, and logs the visit, including its time or duration. Once this data is gathered, the ProReach system 100 architecture, and show the high-level relationships between these components. We will also describe some typical configurations of ProReach, and show how ProReach supports one or more web servers, both behind and across firewalls. A discussion of the basic elements of alliances is included. Referring to FIG. 3, there is shown various ProReach systems 100 operating over the Internet. Each ProReach system 100 handles one or more web servers 102. These web servers 102 can all belong to the same domain, or they can be belong to different domain. FIG. 1 depicts two ProReach systems 100. One ProReach system 100 supports a single web server 102, while the other ProReach system 100 supports two web servers 102. In all, there are three ProReach-enabled web servers 102 in this figure. Each ProReach-enabled web server 102 of a ProReach system 100 tracks the web visits of individual web visitors at the web site that the web server 102 serves. The web server 102 tracks and identifies the web visitor, obtains category information for the viewed content, and logs the visit, including its time or duration. Once this data is gathered, the ProReach system 100 analyzes the data in order to evaluate the web visitor, and create or update a profile of the web visitor. The resulting profile of the user (or other profiles that are effected by the user's visits) can be used for marketing purposes, for page composition or for driving banner ads. The various ProReach system make use of ProReach Global Services 110. These global services 110 perform various tasks that are best centralized for purposes of efficiency and integrity of information. These global service 110, which are further discussed below, including identification of web visitors, maintenance and distribution of standardized categories to the various systems 100, and mechanisms for exchanging information between systems 100. FIG. 1 further depicts two web clients 106, 108. A web client is a conventional computer that includes a web browser, such as Netscape Communicator.RTM. or Microsoft Internet Explorer.RTM.. ProReach integrates with existing web browsers, and a special browser is not necessary to obtain the features or benefits of the invention. As an optional enhancement however, certain web clients 108 may be ProReach-enabled. That means that these clients 108 executes client-side tracking software. On a periodic basis, ProReach-enabled clients 108 automatically use ProReach Global Services 110 to upload the data of their web activities, particularly to track web events of the users of the web client on web sites that are do not have a ProReach system 100. This feature allows a more complete view of a user's interest, since it allows for integration of information about all web activity of the user, not just that activity at the ProReach systems 100 and servers 102. ProReach Global Services 110 is then responsible for sending this client data to various ProReach systems 100. Referring now to FIG. 4, to support multiple web servers 102, each ProReach system 100 is configured in a hub and spoke topology, that includes a hub 204 and one or more spokes 202. Each hub and spoke is a collection of executable software modules. Overall, a ProReach system 100 executes on enterprise server-class hardware, such as a Fujitsu teamserver M800i series server, which is a large scale web-hosting server with 4 Pentium.RTM. II Xeon.TM. processors and 8 GB of memory. The software environment preferably includes Microsoft Windows NT 4.0 as the operating system, including Microsoft.RTM. Internet Information Server.RTM. 4.0 (IIS) for web site management, Microsoft Proxy Server 2.0 for firewall management, Microsoft Site Server 3.0 for content management and delivery based on user and group profiles. More particularly, each spoke 202 is dedicated to collecting and categorizing the visitor data from a web server 102. Once the data is collected from the web server 102, it is partially processed on the spoke 204. The partially processed data is then moved from the spoke 202 to the hub 204. At the hub 204, the data is aggregated and further analyzed to produce up-to-date visitor profiles. Note that data from the same web visitor might stream in from different spokes 202, where the hub 204 aggregate this data into the appropriate user profile. ProReach is architected so that most ProReach services are within company firewalls. Web servers 102 themselves are outside the firewall. A typical ProReach configuration including a ProReach system 100 for a single web server is depicted in FIG. 5. Here, the ProReach-enabled web server 102 is outside the firewall 206. An ProReach spoke 202 is connected to the web server 102, with communication taking place using server-side plug-ins, such as Java servlets. The ProReach spoke 202 itself is connected to a ProReach hub 204, as previously described. In FIG. 5, only one spoke 202 is shown, but as described, multiple spokes 202 may be used, each supporting it own web server 102. ProReach-enabled clients 108, having tracked user visits at non-ProReach web servers 113, send their accumulated usage data to the ProReach Global Services 112. In turn, ProReach Global Services 112 routes the usage data to the appropriate ProReach systems 100. FIG. 5 also illustrates how a ProReach system 100 can partner with other ProReach systems 100. Note how the hub 204 of one ProReach system 100 communicates with other ProReach systems 100. Such communication can involve sharing of data between the systems 100. ProReach also works across web firewalls 206. For example, suppose a company had two web servers 102, each with its own domain name and firewall 206. It might be desirable to track all the web visitors at these web sites. In this case, a different configuration of ProReach is used, in which one of the spokes 202 attached to a local hub 204, and the other spoke 204 is remote and behind another firewall 206. The ability for ProReach to work across firewalls is desirable, particularly when web sites belonging to different organizations or companies are to be grouped together as logical unit, with the data of their web visitors shared. A. Global Services In one embodiment, ProReach provides a number of global services 112. These services are provides by a master host system and server, such as may be provided by an overall provider of ProReach systems 100. The global services are shown in FIG. 6. Global Identifier Service 502. This global service allocates global identifiers [GIDs] and provides other functionality related to visitor identification. A GID is used to globally identify a web visitor, so that the visitor's web events and other usage data can be properly collated when received from many different ProReach systems 100 or ProReach enabled web clients 108. Global Category Tree Service 504. This global service maintains and distributes a standard collection of categories. This allow the different ProReach systems 100 to use a common set of categories for describing and categorizing web content. In this manner, interest information from many different web site can be measured and evaluated against a common framework of categories. Global Upload Service 506. This global service works with the client tracking software to received uploaded web activity data from the various ProReach enabled web clients 108. This global service then distributed this web activity data to the appropriate ProReach systems 100. Global Client Management Service 508. This global service helps manage ProReach-enabled ProReach enabled web clients 108, by keeping a list of all such clients, and by maintaining this list (e.g., adding new ProReach enabled web clients 108 and deleting those no longer in operation). Global Yellow Pages 510. This global service maintains an LDAP directory of ProReach systems 100. Global Exchange Policy Service 512. This global service allows individual ProReach system 100 to describe the business rules under which it will exchange web visitor information with other ProReach-enabled systems 100. III. Basic System Processing ProReach's job is to capture user data, subject it to analysis and produce a visitor profile summary for any individual visitor or groups of visitors collectively. The visitor profile summary describes the interests of that given web visitor or group. There are many different processes involved in producing this web profile summary. These generally are as follows: tracking visitor web visitor activity on the web server; tracking visitor web visitor activity from the web client; categorizing the documents that the web visitor views and determining their weights; aggregating web events by time, by user and by category; identifying the same web visitor when he visits different web sites; aggregating the data --at different web sites --for the same web visitor, so that a global profile of the web visitor results; category discovery and maintenance In the first of the next two sections, we will summarize through some ProReach's key applications processes. Following that section, we will look at category discovery and optimization. A. ProReach Functional Overview In this section, we describe the basic processing steps that take place, in order to show how data flows through a basic ProReach system 100. We will also view in more detail the structural features of a ProReach system 100. Because we want to concentrate on these basic processing steps, we will make some simplifications and only explores a specific scenario. We will explore a scenario where the ProReach-enabled web server 102 only tracks web visits based on cookies resident on web clients 106. So while ProReach also tracks web visitors based on their login name and other information, this tracking is not shown below. We also assume here that the web client 106 allows cookies, which is true for most web clients. In general, the overall process of tracking web activity is as follows: A web client 106 visits a ProReach-enabled site 100. The ProReach-enabled web server 102 redirects the web client 106 to a global service web server 112. This web server 112 is responsible for allocating global identifiers (GID) that identify web visitors. Web visitors are identified as specifically as is possible. Sometimes the identification pinpoints the actual person; sometimes it can only identify the web client 106 being used. The global service web server 112 redirects the web client 100 back to the original ProReach-enabled web site 100 with extra data. That identifies the web visitor. The ProReach-enabled web server 102 takes this identifier and logs the web hit on a log. The entry on the log contains this identifier. The web server 102 reads from this log of web hits and sends the data to a ProReach spoke 202. Processing of each entry on this log begins on the spoke 202. The category of the web pages viewed by the visitor is computed. At this point the ProReach system 100 has determined who has accessed the web page and what the content of the web page is about. Over time, a visitor's repeat visits to a web site 100 will result in a history of web events associated with that web visitor. ProReach manages this data by subjecting the data to an aggregation process. This process both keeps the data compact as possible, but while retaining useful analytical properties. In particular, the aggregation process summarizes web events into more generalized descriptions of web activity, including summaries across users and or categories. After the aggregation step is completed, a profiling step takes place. This profiling step identifies the interests of a web visitor. The result is a web visitor profile summary of his or her interests. The above steps demonstrate basic processing steps used to track, categorize and aggregate web visitor data. The result of these steps is a database of web visitor profiles that can be explored by web marketers, as well as being used for other purposed (selecting banner ads, personalizing content or services). Alternatively, a web marketer can then explore the population of his web visitors by using query tools. These steps will now be explored in detail in the remainder of this section. Referring to FIGS. 7a-7c, there is shown the web server 102 portion of a ProReach system 100. The web server 102 include a profile servlet 730, a category servlet 731, a logger 702, and a visitor log 704. We begin our processing with a visit from a web client 106. The web client 106 accesses 701 a web page hosted by the web server 102. The Logger 702 requests a GID for the web client. To get this GID, the Logger 702 makes a request to the global identifier service 602 of the global ProReach service 112. This request is initiated by redirecting 703 the web client to a ProReach web server that is part of ProReach global services 612, via the HTTP protocol. In FIG. 7c, this web server can check whether the request from the web client 106 includes a ProReach cookie. If the ProReach cookie shows up in the request, the GID is extracted from the cookie. This is the GID that identifies this web client 106. If the request does not include the ProReach cookie, and hence if the web client does not have a GID, then a new GID is generated by the global identifier 612. This GID is guaranteed to be globally unique. The GID that the global service has computed is now returned 707 to ProReach-enabled web server 102 via web redirection. The actual GID is encoded in the URL, so that the ProReach-enabled web client 106 can receive 705 this URL and extract the GID from it, storing the GID in a cookie. Other information is also encoded in the URL so that the web client 106 will be sent back to the page he originally requested. If a web visitor has configured their browser not to accept cookies, the global identifier service 602 can detect this, and will still allocate a GID for this web visitor which is returned via the redirect as a GID in the usual way. However, the value of this GID tells the ProReach-enabled web server 102 not to try and issue a session cookie and to log the events of this web visitor as an unknown or anonymous user. In FIG. 7d, once this GID is returned to the web server 102, the Logger 702 can uniquely identify the web client, and thus Logger logs 709 a web event record to the ProReach Visitor Log 704. This entry contains information on when the web access occurred, the GID, the URL of the web page that was accessed, and it has some other information as well. This sequence of operations is repeated for each web page or other web activity that the visitor generates. As shown in FIG. 7d, the contents of the log 704 are periodically transferred from the web server 102 to a ProReach Spoke 202, which is inside the firewall. The spoke 204 includes various other processing modules, including a log pre-processor 706, a hub visitor log 708, an event queue 710, an event processor 712, a categorizer 714, a page metadata cache 716, and a content recognition engine 718. Once the data reaches the spoke 204, it is pre-processed 706 for inclusion in the Visitor Log 708. The preprocessing turns the data --no matter its specific format --into web events of the standard form (e.g., an object representation of that data). The Event Queue 710 monitors this log 708, and when new web event data is available, it fetches the data and also sorts the web entries by GID. The Event Queue 710 then calls on the Event Processor 712 to process each web event in the log 708. The Event Processor 712 ensures that the web event is categorized by making a request to the categorizer 714. It is possible that the web page has already been categorized, and that this categorization information has been entered as entries into the Page Metadata 716. Prior categorization occurs since ProReach spiders web sites in order to categorize their web pages as early as possible, as to avoid doing categorization at runtime. However, since some web sites produce web content dynamically, ProReach cannot pre-categorize all web pages, and must be prepared to categorized web pages on a just-in-time basis. If the URL visited by the web visitor has already been categorized, then this data can be fetched from the Page Metadata cache 716. If this is not true, then the categorizer 714 then makes calls on a content recognition engine 718. The content recognition engine 718 manages a database of categories. Each category represents some kind of topic, such as "sports" or "news." A web page can be matched against any number of categories. The matched categories describe what a web page is about, and provide a means by which the visitor's interests can be identified. The content recognition engine 718 provides a score for a number of categories, each score measuring the degree to which the page may be said to be about the category. Preferably, a score is provided by the content recognition engine 718 for each category in the category database; alternatively a score is provided only for a selected number of top scoring categories (e.g., top 10 highest scoring categories). When the content recognition engine 718 completes its categorization process of a given web event, it updates the Page Metadata cache 916 for the web event to include a list of the scored categories and their respective scores. Once the cache is updated, the categories of the web event and their respective scores are returned to the Event Processor 712. The Event Processor 712 modifies the web event record to include the results of the categorization for that web event. Alternatively, the categorization information may be stored separately from the web event, and accessed from the web event by some other means, such as a URL. Once the web event record has been categorized, the web event is ready to be sent off to the next stage of processing. That next stage of processing is on the ProReach Hub 204. More generally, the categorized web events are streamed from the ProReach spoke 202 or spokes to the hub 204. In FIG. 7e, there is shown the features of a ProReach hub 204. The hub 204 includes an aggregator queue 722, an aggregation system 724, a profiler 726, a database agent 728, and a profile database 720. The hub 204 maintains a database 720 of web profiles. Each profile in this database 720 is uniquely identified by a GID. In each web profile, the web events of the web visitor are maintained by category. A exemplary web profile will describe a individual (or group's) interest in each of number of categories included in the category database. The ProReach hub 204 takes newly categorized web events and integrates this data with the data of an existing web profile; this updates the profile of the visitor with the most current information about their interests, as captured in the web events generated from their web activity. If a web profile does not exist for the web visitor, then one is created. The first step of this aggregation process is to fetch the needed web profile from the database 720, using the web visitor's GID to select the web profile. When an web event record or a set of event records are aggregated, they are processed in groups where each web event has the same GID. Once the web profile for a GID is retrieved, the Aggregator System 724 performs an aggregation operation for all categories of documents that this web visitor has accessed. In one preferred embodiment, a threshold value for is updating category weights is established, and only those categories for which the document scored higher than the threshold are updated. Generally, the aggregator 726 updates the various user, group, and category summaries as described with respect to FIG. 2. Each of these summaries is held in its own web event record, which identifies both the user or user group or the category to which it applies, and the appropriate other aggregated weight values. Because of this approach, ProReach can retain large amounts of visitor data at lower cost and this data is of higher quality, because it is designed to support the kind of operations needed by web marketers, that is, analysis of user interests and trends. When the aggregation process is completed, the next step is to update the visitor's profile. Profiling 726 is a task that identifies the interests of a web visitor. To understand how this works, we first explore a brief example. Suppose there is a web marketer who wants to identify "sports enthusiasts" using visiting the web site. The web marketer first defines what he means when by "Sports Enthusiasts". There are many ways that this term could be defined: Absolute Interest Magnitude Definition: A sports enthusiast is someone who looks at sports-related web pages at least twenty times every year; Relative Interest Frequency Definition: A sports enthusiast is someone who looks at sports-related web pages more frequently than he looks at other web pages. For example, a sports enthusiast is someone who, if they look at 100 web pages, tends to look at least ten sports-related web pages. Comparative Interest Frequency Definition: A sports enthusiast is someone who looks at sports-related documents much more often than other web visitors Each of these three candidate definitions for the term Sports Enthusiast describe the interest as a function of the weight or weights of a "sports" category or categories, as determined from the web activity of the user. Any of these types of definitions (or others) may be used to define an interest with respect to any set of categories. Logically, an interest may be understood as a query, such as one uses in SQL, against the profile database 720 that determines if a web visitor does not or does not have that interest. The query can be defined to evaluate the weights of any combination of categories. With ProReach, a web marketer can name and define such interests using a simple query tool, such as a query by example tool, that operates on the database 720 via database agent 728, Once an interest is defined, the new interest is added into a given ProReach system 100 and activated. Once an interest is activated, it is the responsibility of the profiler 726 to take each interest and test whether a given web visitor has that interest or does not. When profiling takes place, each activated interest is applied to the web visitor's data to determine if the visitor has that interest. The result is profile which identifies which interests are applicable to the visitor. For example, imagine that there were five active interests in the database 720, such as Sports Enthusiast, Conservative, Hobbyist, Recent Divorcee and Planning For Retirement, each of which has been previously defined by a set of criteria, such as described above, with respect to various categories. Thus, the Conservative interest may be defined by a relative frequency of accessing pages which are categorized in categories deemed to be associated with conservative ideas or beliefs; the Recent Divorcee interest may be defined by comparative frequency (to identify most current behaviors) of viewing web content related to divorce attorneys. Such a set of interests are stored in the database 720 and applied by the profiler 726 to a web visitor's data. The query associated with each interest is applied (as a predicate) and the result of this predicate evaluation is a boolean value. From this processing, a set of results would flow, for example:
Sports Conser- Recent Planning For
INTEREST Enthusiast vative Hobbyist Divorcee Retirement
RESULT YES YES NO NO YES
Note there, the results are Boolean values, indicating whether or not the visitor had the interest. In an alternative embodiment using fuzzy set membership, each interest result may be expressed as a measure of the degree to which the user has the interest (e.g., a scaled value between 0.0 and 1.0). Based on a result such as this example, the web profile of this web visitor is then updated 723. Preferably, a web profile summary record in the database 720 lists the interests of the web visitor. In one embodiment, the web profile summary record contains an interest field which list the interests of the web visitor, as determined by the profiler 726. After profiling completes, this interests field is updated. Each interest is associated with an interest identifier, and so it is actually a sequence of integers that is assigned to this interest field, such as {101,321,19} For example, if the SportsEnthusiast interest has an ID of 101, and the Conservative interest has an ID of 321, and the PlanningForRetirement interest has an ID of 19, then this means the same thing as: {SportsEnthusiast, Conservative, PlanningForRetirement}. Each such interest ID thus concisely identifies an interest for that web visitor. Interests are useful because they help categorize web visitors. However, interests are distinct from categories, in several ways. First, interests describe users or groups of users, whereas categories describe web content. Second, interests are formed from combinations of multiple factors, including category scoring of visited web content, demographics, and the like and thus interests are not easily constrained to hierarchical parent-child relationship, as typified by the categories of the content recognition engine 718. As ProReach profiles web visitors, it computes the interests of each web visitor, and then recomputes them as needed. When this computation is performed, the updated profile summary is then stored 722 back in the database 720 via database agent 728. The result is an updated web profile, with all the data relating to categories, and with all the interests of that web visitor updated as well. Other ProReach tools, such as the query tools, can use this data to quickly pinpoint groups of ProReach web visitors. For example, a query can be made to identify all web visitors who are both "sports enthusiasts" and "conservative." Alternatively, a query could be made to identify all web visitors who are "sports enthusiasts" but who are not "conservative." At this point, we have shown how interests are defined and how profiles are updated to reflect the web visitor's current set of interests. FIG. 7c indicates how the web server 102 can access web profile for any web visitor. The profile servlet 730 on the web server 102 fetchs 731 the web profile of any known web visitor based on a GID, which is obtained either from a cookie resident on the web client 106, or from the global identifier service. It is this ability that makes it desirable to identify the GID of the web visitor. Once the web server 102 has access to the visitor's GID, it can use it to selectively fetch data from the web visitor's corresponding profile. Given the interests in the profile, the web server 102 can dynamically compose a web page so as to maximize the content that would be of greatest interest to the web visitor, for example, by selecting content that most closely matches the categories that the visitor is interested in. ProReach has many other capabilities, such as the tracking of web activities from the web client; it supports the exchange of web profile data between ProReach systems. It supports facilities helping web marketers identify and contact prospects. It supports advanced categorization techniques that allow businesses in vertical markets to create categories suited to their business. It also supports categorization techniques that automate the process of developing and maintaining categories. B. Category Discovery And Maintenance This section introduces ProReach's processes for category discovery and category maintenance. We will describe these processes by example. 1. Category Discovery Suppose a ProReach system 100 has the following categories for computer peripherals, as managed by its content recognition engine 718:
Number of
Category Documents
Storage device 500
CD Rom 80
Hard drives 200
Zip drives 40
Floppy drives 100
The Storage Device category is the parent category for the other categories. First, it should be noted that the total number of documents in the subcategories is 430, whereas there are 500 documents categorized as Storage Device documents. This suggests that there is some other category in these documents that is related to storage, but which is distinct from the existing subcategories. The category discovery process uses statistical analysis to look for the hidden categories in some existing category. As will be further described below, category discovery identifies categories based on frequency and relationships between words appearing in a set of documents. In the example above, this category discovery process might find that many storage documents were about DVDs. It would then identify "DVDs" as a potential new category. In one embodiment, the category discovery process does not automatically create a new category. Instead, any category change suggested by the category discovery process is checked and confirmed by an operator. This interaction with the operator is desirable for a number of reasons. First of all, the category discovery process may make many valuable suggestions, but it may not always be right. Some degree of human guidance is useful to ensure that only meaningful categories get added. Suppose in the above case that the operator confirmed that a new DVD category should be added. Once confirmation is given, the rest of the process is automatic; the category can then be used immediately by the content recognition engine 718 to categorize documents. Existing documents may also be re-evaluated to determine their category score. One issue in determining when to apply the category discovery process is when should a search take place for new categories. In one embodiment a search for new categories takes place when any of the following are true: There are a large number of documents categorized within a given category (e.g., more than a predetermined number or percentage of all categorized documents); or There are signs of a missing category (e.g., parent category having more than a predetermined number or percentage of documents relative to its subcategories); or There are a large number of web visitors accessing the documents with a given category (e.g., more than a predetermined number or percentage of visitors within a selected time period). Also some branches of the category tree will likely exhibit more volatility over time (e.g., high technology). Hence, the historic volatility of that section of the category tree may also be a factor. 2. Category Maintenance Category discovery pertains to discovering new categories. Category maintenance pertains to maintaining and improving existing categories. As with category discovery, the process of category maintenance is preferably an advisory process, which suggests changes to the categories. It does not execute those change unless confirmation is given; alternatively the changes may automatically implemented. In particular, category maintenance provides suggestions for: Removing a category; and Altering the training documents related to a category; Like category discovery, category maintenance involves statistical analysis. For example, a suggestion to remove a category might be made if there are very few web pages concerning this topic and there are very few people looking at such documents. Few documents and few viewers of them suggests that the category is a candidate for deletion. For example, training documents are selected based on scoring; if the category scores are below a threshold the training documents are reselected. Categories are moved when the keywords associated with the category are not scoring sufficiently high. To create category: Select category Select training documents Score training documents, to generate keywords Human judgment as to whether the keywords are reflective of the category. IV. ProReach Systems With Alliances FIGS. 1-6 show how ProReach spokes 202 feed web activity data to a central hub 204 of the ProReach system 100. This hub-and-spoke topology handles one or more web servers 102 in a flexible and scalable fashion. ProReach however, goes beyond this local accumulation of web events. Profiles of visitors maintained on a hub 204 are valuable, but the value of the information increases via aggregation across multiple hubs and ProReach systems 100. This aggregation can be accomplished by the merging of profiles from multiple sources, even when these sources of information belong to separate companies. In existing systems, companies that might benefit from the sharing of visitor profile information are reluctant to do so for several reasons. There is no infrastructure to facilitate this sharing, so sharing the information would require a huge initial outlay of software support. There are also ownership and use issues in respect to the profile information itself: which companies own the profile information, and who decides? In the present invention, alliances are a means of facilitating the sharing of profile information between businesses, and overcoming these barriers to sharing. By doing so, ProReach enables business-to-business sharing of data that is mutually beneficial to the business parties. In many cases, alliances are formed to service the businesses clustered around some vertical market. For example, there might an alliance for pharmaceuticals, or there might be an alliance for oil-related businesses. Referring to FIG. 8, each ProReach system 100 would be a member of zero, one or more alliances 800. Membership in an alliance is voluntary. The members of those alliance 800 send copies of their profile data to alliance 800. This data is then aggregated into an alliance profile. An alliance profile is an aggregation of the profiles collected from the alliance members. Of course, the same web visitor may visit multiple ProReach systems 100 that are members of the same alliance 700. When different local hubs send profiles for the same web visitor, the alliance 700 can take these separate local profiles and assemble them together into a single alliance profile for that web visitor. Using the GID, the alliance can easily compute which profiles belong to the same web visitor, and correctly merge the information in these profiles to avoid duplication. In exchange for providing their local profile information to the alliance, the members of the alliance 700 get some degree of access to the alliance profiles. An ProReach system 100 can be a full access, limited access or minimum access member of an alliance 800. The responsibilities and rewards of each membership level vary. A full access member gets the maximum allowed access to vertical profiles. Full access members must also provide a maximum amount of information from its local profiles. A limited access member gets a moderate degree of access. It must provide a moderate amount of information from its local profiles. A minimum access member gets the least amount of access to vertical profiles. It is required to provide a minimal amount of profile information from its local profiles. Participation in a vertical alliance allows each member controlled access to the jointly produced alliance profiles. Rewards and responsibilities are rationalized through the small number of membership levels. Memberships have to specify what categories of information they will provide and in what volume, and for what kind of web visitor. Hence this scheme provides a credible incentive for individual ProReach systems 100 to participate in various alliances. ProReach systems 100 benefit from being members of alliance by having access to the alliance profiles of the web visitors. Because the alliance profiles are aggregated over multiple web sites and ProReach systems 100, they provide a more accurate and comprehensive assessment of the interests of the web visitor. This in turns allows a given ProReach system 100 to more accurately target web content to the w web visitor when the visitor visits the ProReach system 100 that is an alliance member. V. Aggregation In this section we describe in detail one embodiment of the process by which web events are aggregated by aggregation system 724 in conjunction with the aggregation queue 722. The aggregation queue 722 stores a set of web event records that are unconverted. These records are updated to the queue 722 by the event processor 712 on the spoke 204, in the order in which they are received, that is, as they come in from one or more spokes. Overall, the queue will store the web events generated by many different users over some time period. Referring to FIG. 9, there is shown the logical structure of the aggregation queue 722. The aggregation queue 722 stores a collection of web events 900, each of which represents an instance of some visitor interacting with an item of web content. Each web event 900 contains a user identifier 902 (preferably the GID), a start time 904 of when the web activity began, a duration (in seconds) 906 of the activity (if the duration is not provided, the default is 1 minute), a type (representing either a transaction, a clickthrough or a page view), a URL (the domain name of the web site) and a category vector 908. The category vector 908 includes a list 910 of category identifiers, and respective category scores. Each category score indicates the degree to which the web content is evaluated by the content recognition engine 718 to be about the category. Preferably, there is a category score for each category stored in by the content recognition engine 718. Thus, for example, if there are 1,000 categories used by the content recognition engine 718, then the vector 908 contains 1,000<category ID, score> tuples. In one embodiment, the category scores are in a range from 0 to 1,000,000, but any useful range can be used with the appropriate scaling factors. Referring now to FIG. 10 there is shown an illustration of the components of the aggregation system 724. The aggregation system 724 is generally responsible for various types of services. First, a Daily Aggregation System 919 is responsible for generating daily aggregates from the web events that occur on the web server 102. Second, a Dimensional Aggregation System 941 is responsible for combining the daily aggregates by dimensional combining into the various User and Category complexes illustrated in FIG. 2. Third, a User Group System 950 is responsible for defining and maintaining definitions of user groups. A Profile Service 955 is responsible for maintaining individual user profiles, and responding to queries regarding these aspects. All these services are within the scope of the aggregation system 724. The Daily Aggregation System comprises a Handler object 920, a Calculus object 922, a Parser object 924, an Aggregator object 926. The aggregation queue 722 is also best understood as being a entry point to the Daily Aggregation System 724 (and was illustrated separately in FIGS. 7a-7d for convenience). An Event Dispatcher 930 monitors all the activities within all the services of the Aggregation System, and fires events to whoever is interested in listening to them. The Event Dispatcher is not part of the services within the Aggregation System. It simply monitors and overlook and watches all the activities going on inside the Aggregation System like a camera. The Daily Query object 932 is part of the Daily Aggregation System and is responsible for all queries concerning daily aggregates. The Daily Query object handles all types of queries regarding interests of users, as described above, including defining interests, and identifying users having particular interests (on daily basis). Queries are processed by a query language interpreter 944, which uses a query language 946. The handler 920 exports the interface of the Daily Aggregation System, and manages the remaining components of the daily aggregation service during the daily aggregation process of packets of web events. The Combiner 938 is part of the Dimensional Aggregation System and is responsible for doing dimensional aggregation as scheduled by member of ProReach. More particularly, the Combiner 938 is responsible for the dimensional combining of the daily aggregated web events (or of the complexes) into higher level summaries (e.g., across times, users, group, and categories), such as illustrated in Levels 1-4 of FIG. 2, according to scheduled tasks done by some members. The update object 940 is responsible for updating the Daily Aggregate whenever the Daily Aggregation System processes a packet of web events. The database 720 stores the aggregated information from the web events in a number of different tables. These are as follows: User Table: This table stores information identifying and describing each user. The fields of this table include: userID, last name, first name, this table is indexed by userID. UserID Contact Table: This table contains the following columns regarding the contact address: userID, address, address2, city, state_prov, zipcode, country, and e-mail. Demographic Table: This table contains demographic information about users. It contains the following columns: userID, gender, age, education, job. Members Table: This table contains information about the members of ProReach System, that is the people (or companies) that have an account with ProReach System. This table contains the following columns: ID#, lastname, firstname, e-mail, login, password, URL, account type. The URL represents the domain name of the web site owned by the member. If the member does not own a web site, the URL column will be empty. The account_type represents the type of account the member has. According to this type, the member will have access to certain services and other services might be denied. Categories Table: This table stores all of the categories used by the content recognition engine 718. The table includes the fields: categoryID, category name, and parent categoryID. The table is indexed by categoryID, and secondary indices on name and parent. The parent categoryID is used to construct a hierarchy of categories, and is further used to aggregate low level category information into higher categories. Daily Aggregate Table: Each row in this tables stores daily aggregate objects for a specific user-category combination that occurred on a given day. This information corresponds to the data at Level 0 of the Aggregation Tree shown in FIG. 2. The fields include: userID, categoryID, weight, Deviation, Day, and Trend. Deviation stores a standard deviation of the category weight over the given time period for the specified (by category ID) category. Day stores a date or day number. Trend stores a string or encoded value that describes the shape or slope of a curve of the user's interest of the time period. For example, and as will be further explained below, the trend may describe the curve as "increasing then decreasing", or as "constant then increasing". User Group Table: This table identifies each of the user groups, along with their size and a description of what the user group is about, or what are the rules for defining membership. The fields include: user groupID, group name, description, and size. Size indicates the number of group members. Criterion Table: This table stores the rules which may be used define various membership tests for any of the user groups. Used in conjunction with the user group criterion table, below. The fields include: Criterion ID: identifies the rule number. CategoryID: identifies the category to which the criterion is applied. Minimum: defines the minimum weight a user can have to satisfy the rule Maximum: defines the maximum weight that satisfies the rule. Negation: specifies whether satisfying the rule results in group inclusion or exclusion. Example: Assume that a rule had minimum=20 and maximum=80 and that negation="No." This membership rule means: "for a user to satisfy the membership test, his/her weight for the category must be between 20 and 80" If negation=Yes, then this means that the weight must not be between 20 and 80 in order to be a member of the group for this rule. User Group Criterion Table: This table associates each user group with one or more of the membership rules defined in the criterion table. The field include: user group ID, and criterion ID. Maintained Categories Table: This table contains the set of categories for which information (such as weight, user groups, profiles, and so forth) will be maintained. The field include: Category ID, CurrentValue, Permanent, LowInterested, MediumInterested, HighInterested, and VeryHighInterested. This table allows the system administrator or a marketer to chose which categories will be maintained and which categories will be disregarded. This choice can be either absolute or dynamic. In the absolute case, the marketer simply chose a collection of categories one and for all and maintain information only about these categories. In the dynamic case, the marketer consider all categories on the same foot and giving each category a certain rank in the CurrentValue field. The CurrentValue rank can change dynamically according to how many users are interested in the category. If for example, the CurrentValue drops under a certain level, then the category will be disregarded and removed from the table. If a new category acquires a degree of importance, then it can be added to the table. This is the dynamic case. The marketer can even combine both the dynamic and absolute case. For example, the marketer can chose a certain number of categories to be Permanent (Boolean flag), and other categories to be rather dynamic than permanent. The permanent categories will always stay in the table, and information related to them (through user groups, profiles, etc.) will always be maintained. The dynamic categories are categories that can be removed from this table whenever their CurrentValue is under a certain level. The threshold is preferably defined by a configuration file for the aggregation system 724 or by a system administrator. The other columns of the table such as LowInterested, MediumInterested, HighInterested, VeryHighInterested contain the number of users whose interest in the category is low, medium, high, and very high, as determined by their weights. In one embodiment, these interest grouping are associated with weight quartiles: if the weight is between 1 and 24 the interest is low (hence the user is counted under "LowInterested"); if the interest is between 25 and 49, the interest is medium; if the interest is between 50 and 74, the interest is high, and between 75 and 100, very high interest. Maintained Users Table: This table lists all of the users for which profiles will be maintained. The field include user ID, Rank, and HotCategoryID. The Rank field is a value that can change according to the importance of the user. If this value is under a certain level (e.g., below the 100.sup.th or 1000.sup.th rank), the user will be removed from the table and no profile will be maintained on this user. If however, a new user become very important, then this user will be added to this table and a profile will be maintained for the user. HotCategoryID identifies the category which has the highest category weight for this user. Profile Table: This table describes each user's profile in terms of which user groups the user is a member. The fields include: user ID, user group ID, Member Since, Membership Ended, Current Member, and Last Update. Member Since: identifies the date that the user A user can be a member of many user groups and this membership is also dynamic and changes over time. The profile table keeps a history record of user group membership. For every user group, the profile table indicates when the first time the user became a member (Member Since), whether he/she is still member (Current Member) and when the membership ended (Membership Ended). From this history record of changes between different user groups, one can derive a certain behavior and pattern that can be used to predict user reactions in the future, and use this information for marketing purposes. User-Category Complex Table: This table stores the data for the UC (User-Category) complexes 203 described for FIG. 2. The fields include: user ID, category ID, weight, deviation, weight against categories, weight against population, trend, from and to. User ID and category ID define the respective user-category combination. Weight: describes the average weight of the user's interest in the category specified by category ID. Deviation: the standard deviation for this average. Weight against categories: stores a measure of how important the specified category is for the user relative to other categories. In one embodiment, the value of WeightAgainstCategories is the percentage of the totaled categories weights for the specified category. That is, WeightAgainstCategories for category j is equal to the weight of category j divided by the sum of all category weights, and then multiplied by 100 to create a percentage (though raw decimal value may also be used). Weight against population: stores a measure of how important the specified category is for the user relative to all other users. In one embodiment, the value of WeightAgainstPopulation is the percentage of the totaled categories weights for the specified category relative to all other users. That is, WeightAgainstPopulation for category j and user k is equal to the weight of category j for user k divided by the sum of category weights for category j for all users, and then multiplied by 100 to create a percentage (though raw decimal value may also be used). Trend: describes the shape or slope of the user's interest in the category over the time period defined by From and To. From and To: define the earliest and latest start time of web activity used to generate this complex. User Complex Table: This table stores the contents of the U (User Category) complexes 205. The fields include user ID, weight, deviation, trend, from and to, and categories Count. Since a user complex summarizes the user's interest over many categories, Categories Count tracks the number of categories that interest the user. The number also is the number of children of the user complex object in the aggregation tree. The Categories Count value is used in incremental updating of the weights. When a new user-category complex 207 is formed (i.e., a new child of a user-complex) with a new weight w, then the new weight of the User complex is incremented as follows: new weight (UComplex)=([categoriesCount*old weight(UComplex)]+w)/(categoriesCount+1) Category Complex Table: This table stores the data for the C (Category) complexes 205 described in FIG. 2. The fields include: category ID, Weight, Deviation, Trend, From and To As this complex summarizes over multiple users, thus the weight and deviation are with respect to all users with respect to the time period defined by From and To. Group Category Complex Table: This table stores the contents of the GC (Group Category) complexes 207. The fields include user group ID, category ID, weight, deviation, trend, from and to, and users Count. Users Count tracks the number of users in this group with respect to the selected category. Group Complex Table: This table stores the contents of Group complexes 209, that is group summaries across all categories. The fields include user group ID, Weight, Deviation, Trend, From and To, and user Count. The user count is used to update the weight for a group during incremental aggregation as follows: new weight(GComplex)=((usersCount*old weight(GComplex))+w)/(usersCount+1) where w is the weight of the new added member to the user group. Total Complex Table: Finally, this table stores the overall Total complex 211. Every row corresponds to a total complex 211 for a defined period of time. The fields include: Start Date, LengthDays, LengthWeeks, LengthMonths, LengthYears, weight, deviation, trend, and usergroup Count. The various length fields define the time interval over which the aggregation is performed for a particular complex. The user group count contains the total number of user groups over which the total is aggregated. As with the other counts, this is used during incremental aggregation: new weight(TComplex)=((usergroupCount*old weight(TComplex))+w)/(usergroupCount+1) where w is the weight of a new user group complex 209 being added to the total complex. We now describe the process of aggregating web events. A. Aggregating Daily Web Events The scheduler 934 is responsible for initiating various processes for aggregating web events into aggregated information for various periods of time. Accordingly, on at least a daily basis, the scheduler 934 invokes the handler 920 to aggregate web events from the aggregation queue 722 into daily aggregated events, as shown in Level 0 of FIG. 2. Accordingly, The handler 920 requests and receives a set of web events from the aggregation queue 722 for a given day. The queue 722 keeps tracks of which events have been retrieved, and provides, in response to a handler request, those events which have not been processed, assembling the events that correspond to the desired day. The Aggregation System does the combining using two subsystems. A first subsystem is responsible for generating the daily aggregates from the web events (the web events are called user hits in the terminology of the Aggregation System). The second subsystem is responsible for generating the higher level of aggregation (aggregation over weeks, months, quarters, or years, across categories, across users, across user groups), that is the dimensional combining. The Daily Aggregation Service operates as follows: 1. The Handler object takes a packet of web events from the Aggregation Queue. 2. The Handler sends the packet to the Calculus object to compute the weights of the web events and to scale them from 0 to 100. Let's give a very simple example. Suppose that the packet contains only two web events A and B. Web event A contains only one category C1 with a score 200 and a duration 4 minutes. Web event B contains one category C2 with a score 300 and duration 2 minutes. First, the Calculus object computes the weight for the category C1 in the web event A: weight (C1)=score(C1) *duration=200*4=800. Since there is no other categories in the web event A, we go to the next 20 web event B to compute the weight for the category C2 (in the second web event B): weight(C2)=score(C2)* duration=300*4=600 Since there is no other categories in the web event B, we have finished computing the weights. Now we need to scale the numbers we have just computed, namely 800 and 600. Scaling consists of replacing 800 by: [800/(800+600)]*100=57.14% and replacing 600 by: [600/(800+600)]*100 42.8% Now, if the userID in web event A and in web event B are the same, and category C1 and category C2 are also the same, then in this case, The Aggregator object will average the two weights: (57.14%+42.8%)/2 and keep the average. If the two web events A and B have different userID or different categories, then we do not average, and we keep the two weights 57.14% and 42.8%. In any case, inside the DailyAggregate object, every pair (userID, category) has only one number between 0 and 100 (a percentage number) that we call the weight of the pair (userID, category). If (within a single packet of web events) one (userID, category) pair has many percentage numbers (i.e. many weights), then we average them (this is done by the Aggregator object when the Parser gives the hash map to the Aggregator, as described next). 1. The Calculus object returns the packet (of web events, where the scores are now weights that are scaled) to the Handler object and the Handler gives it to the Parser object. The Parser object transforms the data structure of the packet (from a vector to a hash map) and gives the hash map to the Aggregator object. 2. The Aggregator object computes certain quantities such as the mean, the deviation, trend and the time interval (from, to). The Aggregator object uses the services of the Calculus object to compute these quantities. After computing these quantities, the Aggregator object calls the update methods of the Update object. The Update object has many methods (that all start with the word update). Every method has its special purpose: For example, the method updateDailyAggregate( ) will update the values in the DailyAggregate object using incremental aggregation from the new hash map that was produced by the Aggregator. The method updateUCComplexo updates the values of all UCComplex objects using incremental aggregation from what has changed in level 0 of the aggregation tree, etc. That is, the dimensional aggregation is automatically done (incrementally) just after the Aggregator finishes processing one packet of web events. So the Update object provides data access between the two systems, Daily Aggregation System and Dimensional Aggregation System. Whenever the Daily Aggregation System finishes processing a packet of web events, the Update object starts the Dimensional Aggregation (incrementally) based on what have changed at level 0 of the aggregation tree due to the processing a new packet of web events. There is another aspect of the dimensional aggregation that is scheduled. We have just said that the dimensional aggregation starts automatically (and incrementally) each time the daily aggregation system finishes processing a single packet of web events. Let us explain why we also use a scheduled dimensional aggregation: When the ProReach System is be running, it will have some members. A member is a person or a company that has an account with the central ProReach System. Let's say User A is a member. User A will have a login name and a password, and ID number that is assigned to User A by ProReach System (when you subscribed for the first time). When User A wants to use the services offered by ProReach System, he first to goes the web page of the central ProReach System and logs in using his login name and password. Once he logs in, he can use the services. Here is a short list of the services that he can use: a. Issue queries (on the web page) and the answer to the queries will show on the web page. Queries can be on profiles, user groups, on interest for some categories, etc. b. Create user group and set the membership rules to be satisfied in order that a user be added to the user group User A has created. User A can schedule when to update the members of each user group, when to add new members, and how long he would like to keep each user group in the database. C. If User A owns a web site, he can have the web traffic of your web site be sent to the central ProReach system, so that ProReach can do aggregation for the web events of his site and keep the results of the analysis in the ProReach's database ready for him to query it anytime. These are only examples of the services that can be offered by ProReach System through the web. Each service has a certain fee. There are different types of accounts. Some accounts provide users with a certain set of services, and other accounts may provide users with larger set of services. For example, consider the case of a person (or company) that owns a web site and uses the last service of the list above (that is, service c.). Such a person has the right to chose when to do dimensional aggregation (for the web events of his/her web site) and for what time interval. Such a person can schedule these tasks from his/her account. This is what we call the scheduled dimensional aggregation tasks. This is different from the dimensional aggregation that is done automatically each time the Daily Aggregation System finishes processing a single packet of web events. 1. Transform Category Scores to Weights The handler 920 first invokes the math package 922 to transform the category scores in each web event 900 (within a single packet of web events) into duration adjusted scores. This step normalizes the scores, and removes the need to separately store both the category scores and the duration of the event. Normalization further allows different web events to be compared as to their overall significance with respect to any category or user. The Calculus object 922 operates as follows to support this function. As noted, each web event 900 includes a vector of categories and scores. The Calculus object 922 process each web event 900 in turn (inside a packet of web events). For each category in the category vector of a single web event 900, the math package 922 scales each category score by the duration of the web event, and with respect to all other category scores for that web event. In one embodiment, the scaling process is as follows: First, the Calculus object 922 adjusts each score by the duration of the web event and the type of the web event: NewScore=Score*Duration*type where NewScore is the adjusted category score (that we will call weight after it will be scaled from 0 to 100), Score is the original category score, Duration is the time between the start time and end time (or the duration value if directly provided. If it is not provided, the duration's default value is 1 minute) and type is the a number that depends on the type of the web event. For example, if the web event is a transaction, the type would be higher than just a clickthrough or a page view. The type of a page view is higher than the type of a clickthrough. Next, the Calculus object 922 scales the adjusted scores relative to all of the adjusted scores: ##EQU1## where n is the number of categories (all the categories inside the packet of web events. A packet of web events might contain 10 web events. And each web event might contain 20 categories. So the total number of categories might be 200), and i iterates over each category. The result of this process is that each web event 900 now contains a list of weights in place of the original category scores. The weights succinctly describe the significance of the category with respect to all other categories for that particular web event; more particularly, the weights describe as each category's score as a percentage of all of the time-adjusted scores. 2. Restructure Web Event Records to Collate Category Weights by User The handler 920 next calls the parser 924, and passes in the updated packet of web events 900. The parser 924 restructures the packet for input into the Aggregator object 926. More particularly, the parser 924 collates the category weights of a number of web event records 900 first by user, and then by category. Referring to FIG. 11, there is an example illustration of the processing function of the parser 924. As inputs, the parser takes a packet 900, each web event inside the packet includes, in part, the category vector 908. As described above, the web event includes a user ID 902, start time, duration, type (that is transaction, clickthrough or page view), URL (domain name of the visited web site) and N<category, weight> pairs, where N is the number of categories. The various web events correspond to different users, and there are likely to be many web events for the same user, since each clickthrough, transaction, page view, etc. may generate a web event. Let us explain the task of the Parser object by a very simple example. Suppose that the packet of web events contains only 5 web events that we may call for example: we1, we2, we3, we4, and we5. (we is an abbreviation for Web Event). Assume that the first, third and last web events (we1, we3, we5) all have the same userID (let's call this userID by Jack). Assume further that a category C exists inside the three web events we1, we3, we5. We have three weights for the pair (Jack, C): w1, w3, w5. The first weight w1, is the weight of the category C inside the first web event we1: w1=weightaack, C) inside web event we1 The second weight w3 is the weight of the same category C for the same user Jack, but inside the third web event we3: w3=weight(Jack, C) inside web event we3 The third weight w5 is the weight of the same category C for the same user Jack but inside the last web event we5 of the packet: w5 weight(Jack, C) inside web event we5 The Parser object associates the sequence (w1, w3, w5) to the pair (Jack, C). The sequence (w1, w3, w5) is a sequence of weights for different instant of time and it represents a curve (a function of time that measures the interest of the user Jack for the category C). This function is given only by this sequence (w1, w3, w5), and is thus a discrete function. Ideally, we would like to have a continuous function because a continuous function can shows us clearly what the shape of the graph is. If we know the shape of this graph (as a curve) than we know how the interest of Jack to the category C is changing with time. Since the sequence (w1, w3, w5) represents a discrete function and not a continuous function, we apply the rules of Probability theory to this discrete function in order to get some information about it. The first thing we do about this discrete function is to compute what in Probability theory is called the expectation of the random variable. In our case, this expectation is simply the average of the weights in the sequence (w1, w3, w5). This average is called the mean and it is computed by the Aggregator object (with the help of the Calculus object). The second thing the Aggregator does, is to compute the "error", or what Probability theory calls the variance of the random variable. This "error" is called deviation. The third thing that the Aggregator object does is to determine what is roughly the shape of the graph of the discrete function represented by the values (w1, w3, w5). Is the shape of an increasing curve, or a decreasing curve or some sort of combination of the two? The shape of this curve is called the trend. Once this is done, the Aggregator object associates the data (mean, deviation, trend) to the pair (Jack, C) in some data structure (like a hash map, or a hash table, or the like . . . ). The Aggregator does all this for every pair (user, category). When the Aggregator finishes the processing, the result (which is a hash map, or hash table, . . . ) forms an object that we call DailyAggregate. Therefore, a Daily Aggregate is an object that contains may pairs (user, category), and for every pair (user, category) there is associated to it a data of the sort (mean, deviation, trend). There is also a time stamp which is the time interval that was covered by the packet of web events. In conclusion, the Daily Aggregation System processes a single packet of web events, and produces a result object that we call DailyAggregate. When the Daily Aggregation System finishes processing a packet of web events (by producing a DailyAggregate object), it goes again to the Aggregation Queue to pick up another packet of web events. The Daily Aggregation System keeps processing web events from the Aggregation Queue by packets. Now assume that we start the Daily Aggregation Service for the first time. The Daily Aggregation System goes to the Aggregation Queue and picks up the first packet of web events (packet1). After processing packet1, it produces an object (called daily aggregate, or just aggregate for short). Let us call this aggregate by agg1. Now the Daily Aggregation System goes again to the Aggregation Queue and takes the second packet of web events (packet2) and process it. After processing packet 2, it produces a second aggregate, that we can call agg2 for example. This aggregate agg2 is merged with agg1 to form only one aggregate object that we can call agg12, for example. After fusion, the aggregate agg1 and agg2 both cease to exit, and only the aggregate agg12 exists in the database. This fusion between agg1 and agg2 is an incremental aggregation that is carried out by the Update object (through its updateDailyAggregate( ) method). The new aggregate object agg12 represents the outcome of processing a single packet of web events that is the union of the first two packets, packet1 and packet2. Daily Aggregate objects (or aggregates for shorts) are the data at level 0 of the Aggregation Tree illustrated in FIG. 2. Each day is represented by a single Daily Aggregate object. The result is that for a given user associated with a number of web event records--as will typically occur during a visit to a web site, perhaps generating 20 to 100 or more web events the category weights from the many different records are collected and collated in a single category hash table 1100, so that for each category, all of the weights and start times are packaged together. This allows all of the relevant information about the user's web activity during the day the web event records were collected to be easily accessed from a single data source. 3. Create Category Interest Time Model Information The result of the prior step is one user-category table 1100 for each user that appeared on the web server 102 on the day being processed. With each of these user-category hash table 1100, the handler 920 next calls the aggregation engine 926. The aggregation engine 926 processes these tables into a category interest time model information for each user. The summarized information describes the particular user's interests in the various categories over the day for the collected web event records. The aggregation engine 926 operates as follows on each received user-category hash table: First, for each category table 1100 the aggregation engine 926 sorts the category's weight list 1102 by the start times. The aggregation engine 926 preferably does this by call a sorting routing in the math package 922. The result is a set of data points, essentially a curve, which describes the user's level of interest in the category over the time period from the earliest start time to the latest start time. FIG. 12 illustrates such a category interest curve 1200, for a hypothetical "Art Deco" category. The graph shows the data of 14 web events related to this category, sorted by their starting time, and shows that the user's interest was initially very high, then declined, and then rose again. The goal at this next stage is then to capture each category interest curve 1200 mathematically, and eliminate the need to store the underlying weight and time data of the weight list. More particularly, for each category, the aggregation engine 926 determines the expected value of the category interest curve 1200 over the time period (e.g.,, one day). In one embodiment, the aggregation engine 926 determines the mean weight and the standard deviation of the weights in the category for the time period. The mean weight is simply the total of all weights in the weight list 1102 for the category divided by the number of weights, which will be the number of web events for this user during the time period. The standard deviation is computed normally. Again, these computations are preferably performed by the math package 922, as requested by the aggregation engine 926. The aggregation engine 926 then creates a trend description for the category interest curve. The trend description describes the changes in the user's level of interest in the category over the time period represented by the curve. Preferably, this trend description is a string description (or its coded equivalent). To obtain this trend in one embodiment, the aggregation engine 926 first takes the difference between the weight of the earliest start time and the mean weight. This describes whether the curve is increasing, decreasing, or constant relative to the earliest start time. Next, the aggregation engine 926 takes the difference between the mean weight and the latest start time, and again, determine if the curve is decreasing, increasing or constant. Thus, there are nine possible trends: 1. Increasing, decreasing 2. Increasing, constant 3. Increasing, increasing 4. Constant, decreasing 5. Constant, constant 6. Constant, increasing 7. Decreasing, decreasing 8. Decreasing, constant 9. Decreasing, increasing. The aggregation engine 926 determines the appropriate time trend, and stores information for this time trend for the category. The stored information may be the strings themselves ("increasing," "constant," and "decreasing"), or code value for these (e.g., 1=increasing, and so forth). Obviously, more than three times/two segments can be selected to result in more complex time trend descriptions. The aggregation engine 926 may apply other methods to determine the time trend of the category interest curve. In another embodiment, the aggregation engine 926 selects a number of sample times in the interest, including a point at or near the earliest start time, a point at or near the latest start time, and a number of times between these two times. Then beginning with the first selected time, the aggregation engine 926 determines whether the curve is increasing or decreasing, or constant to the next selected time, and assigned a string or code equivalent to that portion of the curve. For example, in one embodiment, three times are selected: the earliest start time, the middle start time, and the last start time. With these three times, there are two curve segments, and, the aggregation engine 926 determines whether the curve is increasing, decreasing or constant in each segment. In yet another embodiment, the aggregation engine 926 determines the time trend, by identifying the times at which the slope of the category interest curve changes from positive to negative, and storing both the start time, and the appropriate descriptive information about the time period being described. With the time trend information, the aggregation engine 926 now has a complete description of the user's category interest for the given day. More specifically, it can store the following category time pattern model for subsequent use: {User ID, Category ID, Mean Category Weight, Category Weight Standard Deviation, From, To, Trend} where "From" is the earliest start time, and "To" is the latest start time in the sorted weight list 1102, and Trend is the description of the curve changes (either string or encoded). The underlying category weight information from the raw web events can now be deleted, and the category time pattern model stored in the database 720 in the User-Category table. This process is repeated for each category weight list in the user-category hash table 1100. B. Dimensional Combining. The combiner 938 is the component that is responsible for combining the daily aggregated information summarized complex information of the various complexes of The dimensional aggregation tasks carried out by the Combiner object correspond to scheduling tasks make by some members. The automatic (incremental) dimensional aggregation that occurs all the time is carried out by the Update object. Referring again to FIG. 2, there is shown the various levels of aggregated information that are provided by ProReach, specifically which are computed by the combiner 938. The combiner 938 is designed to combine any provided set of category interest time pattern information with respect to any combination of user, category, or time period. We describe the operation of the combiner 938 with respect to the various levels of aggregated information in FIG. 2. Generally, each of the aggregate complexes in FIG. 2 contains a weight value, as described with respect to each of the tables of the database 720. The weight value is computed by an aggregation function which operates on the weight values of all of the complexes which contribute to the complex being evaluated. For example, if | ||||||
